## Scikit-learn
Another Python module that is very useful for data analysis and exploration is scikit-learn. This module contains statistical, machine learning, and related methods, including different types of regression and clustering.  
In this notebook, you can use scikit-learn to explore the Titanic data we've worked on this week.

In [1]:
# first, let's import scikit-learn and our dataset
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import r2_score

# import titanic_cleaned.csv using pandas here
titanic = pd.read_csv('titanic_cleaned.csv')

One of the most basic ways of analyzing you data is using regression. Statsmodels has a pretty simple linear regression method.

In [2]:
X = titanic[['Age', 'Pclass']]  # this syntax lets us select multiple columns from the dataframe
y = titanic['Fare']

model = linear_model.LinearRegression()  # instantiate a linear model
model.fit(X, y)  # fit the model betas
y_predicted = model.predict(X)  # predict y from X
r2 = r2_score(y, y_predicted)  # calculate model R-squared
print(f'R-squared is {r2}')

R-squared is 0.30983110915975454


In [32]:
print(model.coef_)
print(model.intercept_)
titanic.head()

[ -0.391521   -34.63399298]
123.76608305402485


Unnamed: 0.1,Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,ParCh,Ticket,Fare,Embarked
0,0,1,no,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,1,2,yes,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,2,3,yes,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,3,4,yes,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,4,5,no,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


That is a lot of information, but as you can see, the R-squared isn't super great. Can you add some more predictors?

In [37]:
# think about which predictors might be most informative and add them to X, then run the model again
X = titanic[['Age', 'Pclass', 'SibSp', 'PassengerId', 'ParCh']]  # this syntax lets us select multiple columns from the dataframe

model = linear_model.LinearRegression()  # instantiate a linear model
model.fit(X, y)  # fit the model betas

# then check the R-squared in the model summary
y_predicted = model.predict(X)  # predict y from X
r2 = r2_score(y, y_predicted)  # calculate model R-squared
print(f'R-squared is {r2}')

print(model.coef_)
print(model.intercept_)

R-squared is 0.36854144307014236
[-1.54131876e-01 -3.42183641e+01  5.83703905e+00  2.65043182e-04
  1.02441371e+01]
108.67012815316474


Of course the really interesting thing to predict would be survival. This is a binary variable, so we want to use logistic regression.  
The logistic regression method in sklearn is conveniently called `linear_models.LogisticRegression()`. Use it now to construct a good model of survival.  

__Hint:__ Use `pd.get_dummies()` to turn categorical columns into dummy coded columns.  

In [48]:
y = pd.get_dummies(titanic['Survived'])['yes']
# construct your own X variable with predictors here
titanic['SexBin'] = pd.get_dummies(titanic['Sex'])['male']
X = titanic[['Pclass', 'SexBin', 'ParCh', 'Age', 'Fare']]

# do the regression using the linear regression code above (but replacing the model type for logistic regression)
model2 = linear_model.LogisticRegression()
model2.fit(X, y)
y_predicted = model2.predict(X)
r2 = r2_score(y, y_predicted)

# and print the R-squared
print(f'R-sq is {r2}')
print(model2.coef_)

R-sq is 0.12366870245365913
[[-0.87221732 -2.45701559 -0.19501741 -0.02548508  0.00364616]]


Turns out it's still really hard to predict who survived, but remember: Doing Titanic stats > watching the movie.  

Now explore the scikit-learn documentation to see what learning algorithms are in there. Play around with K-means and Nearest Neighbor clustering or Naive Bayes classifiers and this dataset if you have time. You might be able to get better predictions for the Titanic data than with simple regression models.