## First model with scikit-learn

scikit-learn API: .fit(X,y), predict(X), .score(X,y).

how to evaluate the generalization performance of a model with train-test-split.

In [7]:
#Check working directory
pwd

'/Users/Nguye061/Documents/GitHub/scikit-learn-mooc/datasets'

In [3]:
import pandas as pd

In [9]:
#Load dataset
adult_census = pd.read_csv("adult-census-numeric.csv")

In [10]:
adult_census.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,class
0,41,0,0,92,<=50K
1,48,0,0,40,<=50K
2,60,0,0,25,<=50K
3,37,0,0,45,<=50K
4,73,3273,0,40,<=50K


In [11]:
#Choose a target variable and check
target_name ="class"
target =adult_census[target_name] #select this column
target.head(2)

0     <=50K
1     <=50K
Name: class, dtype: object

In [12]:
#create the data for analysis by dropping the target column and check
data = adult_census.drop(columns = [target_name])
data.head() # () = method is a function (e.g a verb, action) attached to a class

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,41,0,0,92
1,48,0,0,40
2,60,0,0,25
3,37,0,0,45
4,73,3273,0,40


In [14]:
data.shape #attribute (property/characteristic of the dataframe)

(39073, 4)

### Fit the first model

Method 1: K-nearest neighbours

In [16]:
from sklearn.neighbors import KNeighborsClassifier

In [38]:
# TRAIN: Choose the model and fit the data (X) and the target (y)
model = KNeighborsClassifier()
#Give it training data and training target, then we get the model state (sth is learned and stored in there)
model.fit(data, target)

In [39]:
#PREDICT: Test data and model state are used both to make predictions
target_predicted = model.predict(data)

In [40]:
#Check top 5 of predicted outcome
target_predicted[:5]

array([' >50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K'], dtype=object)

In [41]:
# Then compare with the true data
target[:5]

0     <=50K
1     <=50K
2     <=50K
3     <=50K
4     <=50K
Name: class, dtype: object

In [42]:
#Check for accuracy (average of True and False)
(target == target_predicted).mean()

0.8219486602001382

### Train-test-split

In [43]:
adult_census_test = pd.read_csv('adult-census-numeric-test.csv')

In [44]:
#Set target variable
target_test = adult_census_test[target_name]
data_test = adult_census_test.drop(columns = [target_name])

In [45]:
data_test.shape

(9769, 4)

In [46]:
#Check accuracy
accuracy = model.score(data_test,target_test)
accuracy

0.80202681953117

Performance on training data is a bit higher than test data, so we are overfitting.

### Train-test-split with K-nearest neighbors with n = 50 , same datasets

In [47]:
# TRAIN: Choose the model and fit the data (X) and the target (y)
model = KNeighborsClassifier(n_neighbors = 50)
#Give it training data and training target, then we get the model state (sth is learned and stored in there)
model.fit(data, target)

In [48]:
#PREDICT: Test data and model state are used both to make predictions
target_predicted = model.predict(data)

In [49]:
#Check top 10 of predicted outcome
target_predicted[:10]

array([' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K',
       ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)

In [50]:
#Check for accuracy (average of True and False)
(target == target_predicted).mean()

0.8290635477183733

In [51]:
adult_census_test = pd.read_csv('adult-census-numeric-test.csv')

In [52]:
#Set target variable
target_test = adult_census_test[target_name]
data_test = adult_census_test.drop(columns = [target_name])

In [53]:
#Check accuracy
accuracy = model.score(data_test,target_test)
accuracy

0.8194288054048521

# Working with numerical data

In [55]:
adult_census = pd.read_csv('adult-census.csv')

In [59]:
adult_census = adult_census.drop(columns = 'education-num')

KeyError: "['education-num'] not found in axis"

In [58]:
adult_census

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [60]:
data = adult_census.drop(columns = 'class')
target = adult_census['class']

In [61]:
data.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

In [62]:
data.info

<bound method DataFrame.info of        age      workclass      education       marital-status  \
0       25        Private           11th        Never-married   
1       38        Private        HS-grad   Married-civ-spouse   
2       28      Local-gov     Assoc-acdm   Married-civ-spouse   
3       44        Private   Some-college   Married-civ-spouse   
4       18              ?   Some-college        Never-married   
...    ...            ...            ...                  ...   
48837   27        Private     Assoc-acdm   Married-civ-spouse   
48838   40        Private        HS-grad   Married-civ-spouse   
48839   58        Private        HS-grad              Widowed   
48840   22        Private        HS-grad        Never-married   
48841   52   Self-emp-inc        HS-grad   Married-civ-spouse   

               occupation relationship    race      sex  capital-gain  \
0       Machine-op-inspct    Own-child   Black     Male             0   
1         Farming-fishing      Husband   

In [63]:
numerical_columns = ['age','capital-gain','capital-loss','hours-per-week']
data_numeric = data[numerical_columns]

### Making a train-test split

In [64]:
from sklearn.model_selection import train_test_split

In [66]:
# 75-25 split (data_train, data_test, target_train, target_test
data_train, data_test, target_train, target_test = train_test_split(data_numeric, target, random_state=42, test_size =0.25)

In [67]:
#Check the features of training data
data_train.shape

(36631, 4)

In [68]:
#Check the features of test data
data_test.shape

(12211, 4)

#### Train a Logistic regression model

For example: 0.1*age + 3.3*hours-per-week - 15.1 >0 predict 'rich'if < 0 predict 'poor'

In [70]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [71]:
model.fit(data_train, target_train)

In [72]:
model.score(data_test, target_test)

0.8070592089099992

### Exercise

In [74]:
# Import dummy classifier (default = 'prior')
from sklearn.dummy import DummyClassifier
model = DummyClassifier()

In [75]:
model.fit(data_train,target_train)

In [76]:
model.score(data_test, target_test)

0.7660306281221849