## Scikit-learn
Another Python module that is very useful for data analysis and exploration is scikit-learn. This module contains statistical, machine learning, and related methods, including different types of regression and clustering.  
In this notebook, you can use scikit-learn to explore the Titanic data we've worked on this week.

In [1]:
# first, let's import scikit-learn and our dataset
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import r2_score

# import titanic_cleaned.csv using pandas here

One of the most basic ways of analyzing you data is using regression. Statsmodels has a pretty simple linear regression method.

In [8]:
titanic = pd.read_csv('titanic_cleaned.csv')
titanic.head()
titanic.Sex = titanic.Sex.replace(['male','female'], [1,0])

In [9]:
X = titanic[['Age', 'Pclass']]  # this syntax lets us select multiple columns from the dataframe
y = titanic['Fare']

model = linear_model.LinearRegression()  # instantiate a linear model
model.fit(X, y)  # fit the model betas
y_predicted = model.predict(X)  # predict y from X
r2 = r2_score(y, y_predicted)  # calculate model R-squared
print(f'R-squared is {r2}')

X = titanic[['Age', 'Sex', 'Pclass']]  # this syntax lets us select multiple columns from the dataframe
y = titanic['Fare']

model = linear_model.LinearRegression()  # instantiate a linear model
model.fit(X, y)  # fit the model betas
y_predicted = model.predict(X)  # predict y from X
r2 = r2_score(y, y_predicted)  # calculate model R-squared
print(f'R-squared is {r2}')

R-squared is 0.30983110915975454
R-squared is 0.31949028435207527


That is a lot of information, but as you can see, the R-squared isn't super great. Can you add some more predictors?

In [None]:
# think about which predictors might be most informative and add them to X, then run the model again
# then check the R-squared in the model summary

Of course the really interesting thing to predict would be survival. This is a binary variable, so we want to use logistic regression.  
The logistic regression method in sklearn is conveniently called `linear_models.LogisticRegression()`. Use it now to construct a good model of survival.  

__Hint:__ Use `pd.get_dummies()` to turn categorical columns into dummy coded columns.  

In [6]:
X = titanic[['Fare','']]  # this syntax lets us select multiple columns from the dataframe
y = pd.get_dummies(titanic['Survived'])['yes']
print(y)

model = linear_model.LogisticRegression()
model.fit(X, y)  # fit the model betas
y_predicted = model.predict(X)  # predict y from X
r2 = r2_score(y, y_predicted)  # calculate model R-squared
print(f'R-squared is {r2}')
# construct your own X variable with predictors here

# do the regression using the linear regression code above (but replacing the model type for logistic regression)
# and print the R-squared

0      0
1      1
2      1
3      1
4      0
5      0
6      0
7      0
8      1
9      1
10     1
11     1
12     0
13     0
14     0
15     1
16     0
17     1
18     0
19     1
20     0
21     1
22     1
23     1
24     0
25     1
26     0
27     0
28     1
29     0
      ..
859    0
860    1
861    0
862    0
863    1
864    1
865    0
866    0
867    1
868    0
869    1
870    0
871    0
872    1
873    1
874    0
875    0
876    0
877    1
878    1
879    0
880    0
881    0
882    0
883    0
884    0
885    1
886    0
887    1
888    0
Name: yes, Length: 889, dtype: uint8
R-squared is -0.41451301832208287


Turns out it's still really hard to predict who survived, but remember: Doing Titanic stats > watching the movie.  

Now explore the scikit-learn documentation to see what learning algorithms are in there. Play around with K-means and Nearest Neighbor clustering or Naive Bayes classifiers and this dataset if you have time. You might be able to get better predictions for the Titanic data than with simple regression models.