<a href="https://colab.research.google.com/github/oneryigit/for_my_reference/blob/main/Built_in_STATA_and_R_Datasets_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading built-in `STATA` and `R` datasets within `Python` 

- `Statsmodels` can load datasets from `STATA` and `R` built-in datasets.

- All fellow Pythonist know that they can load datasets from `Seaborn` or `Sklearn`.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm

  import pandas.util.testing as tm


### Load Built in Datasets from STATA

In [2]:

sm.datasets.webuse('auto').head()

#to see other datasets on STATA: https://www.stata-press.com/data/r17/

#just use the name of dataset without .dta. 

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
3,Buick Century,4816,20,3.0,4.5,16,3250,196,40,196,2.93,Domestic
4,Buick Electra,7827,15,4.0,4.0,20,4080,222,43,350,2.41,Domestic


### Load Built in Datasets from R

In [3]:
sm.datasets.get_rdataset(dataname='mtcars', package='datasets').data.head()

#List of datasets: https://vincentarelbundock.github.io/Rdatasets/datasets.html

# defult Package is 'datasets', you do not have to type package. 

# but you can spesify package and get the data from there. See the example below. 

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [4]:
#another example from R with a different package

#do not forget the .data to see the dataset.

sm.datasets.get_rdataset(dataname='Election16', package='Stat2Data').data.head()

Unnamed: 0,State,Abr,Income,HS,BA,Adv,Dem.Rep,TrumpWin
0,Alabama,AL,43623,84.3,23.5,8.7,-17,1
1,Alaska,AK,72515,92.1,28.0,10.1,-17,1
2,Arizona,AZ,50255,86.0,27.5,10.2,-1,1
3,Arkansas,AR,41371,84.8,21.1,7.5,-7,1
4,California,CA,61818,81.8,31.4,11.6,16,0


### Load Datasets from Seaborn

In [5]:
# Well, those who are familiar with Python already know this :) 

sns.load_dataset('mpg').head()

# See other datasets: https://github.com/mwaskom/seaborn-data

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


### Load Datasets from Sklearn

In [6]:
from sklearn.datasets import load_iris

In [7]:
df=load_iris()

df.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [8]:
df.data[:10] #to see a few rows. 

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

In [9]:
df.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [10]:
df.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [11]:

#If you work with ML, sklearn dataset form, which is array, what you need.
#If you want to have dataframe then, other options are better. 
# Of course there is always make Sklearn array data to pandas DataFrame
#but, what is the point, just load from others :)