In [120]:

import pandas as pd
from sklearn.cross_validation import train_test_split

LogisticRegression	 X	 X_test	 X_train	 a_last_char	 bench	 clf	 confusion_matrix	 data	 
df	 females	 males	 np	 pd	 test	 train	 train_test_split	 y	 
y_pred	 y_test	 y_train	 


In [29]:
### This is where the notebook starts..

In [35]:
## Lets load the training data first.
data = pd.read_csv('names_data_train.csv', index_col=0)

How does the data look?

In [37]:
data.head(5)

Unnamed: 0,name,gender
6528,bharat,m
8411,rakesh alias kalua,m
4964,laxmeena,f
2896,parsanjeet sarkar,m
13046,sangeeta,f


It's a classification problem. We need to classify a name into class male or female. But wait, we don't have any numbers. And
algorithms only understand numbers

It's easy to convert gender into numbers, let m --> 1 and f ---> 0

In [43]:
data['gender'] = data.gender.replace(to_replace='m', value=1)
data['gender'] = data.gender.replace(to_replace='f', value=0)

In [67]:
data.head(5)
data.dropna(inplace=True)
print data.shape

(22627, 2)


But wait, we usually have features which are numbers from which predict the class. Here we only have text. So, we have to extract numerical features from the numbers. But how? 

**More importantly, look at the dataset yourself, what do you think is there in a name which is can possible help classifiy it into a particular gender.**


Good, now read that sentence again and make sure you grasp it. What do you think is in a name which makes it belong to a particular gender? What does your human instinct say? Can that be replicated here as numbers?

First, let me divide this dataset into train and test

In [68]:
from sklearn.cross_validation import train_test_split
X = data[['name']]
y = data['gender']

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

In [70]:
print X_train.shape
print y_train.shape

(18101, 1)
(18101,)


In [71]:
print X_test.shape
print y_test.shape

(4526, 1)
(4526,)



Now that you have thought some features out, how do we implement them? (you can ask our help here)

We will make functions which take the name as input and burp out features.

Eg. def a_last_char() --> 1 if a is last character, else 0

In [74]:
def a_last_char(naam):
    if naam[-1].lower() == 'a':
        return 1 
    else:
        return 0


Then we call **map** which calls this function on each name and save the output in a new column


In [75]:
X_train['f_a_last_char'] = X_train['name'].map(a_last_char)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [80]:
import numpy as np


In [83]:
np.corrcoef(X_train.f_a_last_char, y_train)[0][1]

-0.36378251952645557

High correlation, it means when f_a_last_char goes to 1, gender goes to 0. Negatively correlated. 
That means, wherenever there is 'a' in the last character, gender tends to be female.
See, there you go, we made our first feature. Let's feed this into a Logistic Regression

In [84]:
from sklearn.linear_model import LogisticRegression

In [85]:
clf = LogisticRegression()

In [88]:
clf.fit(X_train.drop('name', axis=1), y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [89]:
clf.score(X_train.drop('name', axis=1), y_train)

0.65123473841224244

65% accuracy! Already quite high on train. Some other metrics?



In [90]:
from sklearn.metrics import confusion_matrix

In [91]:
y_pred = clf.predict(X_train.drop('name', axis=1))
confusion_matrix(y_train, y_pred)

array([[3739, 5534],
       [ 779, 8049]])

[https://en.wikipedia.org/wiki/Confusion_matrix](https://en.wikipedia.org/wiki/Confusion_matrix)

But how does it perform on unseen data? For that you need to generate similar features in X_test too.

In [117]:
X_test['f_a_last_char'] = X_test['name'].map(a_last_char)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [118]:
clf.score(X_test.drop('name', axis=1), y_test)

0.65267344233318603

Almost the same, very balanced dataset. Nothing fancy we did. Followed  basics

-----------

What other features does data show me? (as compared to human instich)

In [102]:
bench = X_train.copy()
bench['gender']= y_train.copy()

In [112]:
bench['last_two_chars'] = X_train.name.map(lambda x: x.split(' ')[0][-2:])

In [113]:
bench.groupby('gender')['last_two_chars'].describe()

gender        
0       count     9273
        unique     202
        top         ta
        freq       878
1       count     8828
        unique     237
        top         sh
        freq      1029
dtype: object

This shows, men (1029/8828) usually have their **first** names end with sh , and females (878/9273) with ta?

In [115]:
bench.loc[bench.last_two_chars=='ta'].head(5)

Unnamed: 0,name,f_a_last_char,gender,last_two_chars
8414,ghaseeta ram,0,1,ta
14164,sarita,1,0,ta
1351,vinita,1,0,ta
10177,savita,1,0,ta
5507,sweta,1,0,ta


Can you go up and encode this as a feature and put back into Logistic?