## Lesson outline :
* Basic : (35 mins)

    * import and explore data
    * data visualization
    * data manipulation
    * data cleaning & handling missing values
    
    
* ML process : (35 mins)

    * train/test split
    * train a model
    * test your model
    * cross validation
    * model selection
    * model tuning
    * feature engineering (not in this session)
    * test your production model
    
 
* HW guided tour (20 mins)


In [None]:
import pandas as pd
import sklearn.ensemble
import seaborn as sns
# allow plots to appear within the note
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
#load the data
titanic = pd.read_csv('titanic.csv', header=0)


In [None]:
#display the first 5 samples
titanic.head()

In [None]:
#display the last 5 samples
titanic.tail()

In [None]:
#numeric values statistics
titanic.describe()

In [None]:
#data information
titanic.info()

#### מה ניתן להסיק מהסתכלות על הנתונים ?

## Seaborn
##### note : you can also use matplotlib / pandas esc.

In [None]:
# in %
sns.barplot(x="Sex",y="Survived",data=titanic)

In [None]:
#count
sns.countplot(x="Embarked",data=titanic)

In [None]:
#alternative
titanic.Embarked.value_counts().plot('bar') #try also barh

In [None]:
# use color as another dimention
sns.barplot(x="Sex",y="Survived",hue = "Pclass",data=titanic)

### Exercise : plot barplot to describe survival rate vs. Embarked & Sex

### another useful functions :
df.corr() - Compute pairwise correlation of columns, excluding NA/null values

pd.crosstab(col1,col2) - computes a frequency table of two features

In [None]:
#fig = plt.figure()
#fig.set_size_inches(10,5)

sns.heatmap(titanic.corr())

In [None]:
fig = plt.figure()
fig.set_size_inches(6,4)
sns.heatmap(pd.crosstab(titanic.Embarked,titanic.Pclass))

## Exercise 
* load the chicago crimes dataset (ward42.csv)
* take check statistics , info and look at the first 7 crimes
* plot the connection between crime types and location description

## Data manipulation

In [None]:
titanic.head()

In [None]:
titanic.drop(['Name'], axis=1)
titanic.head()

why did the 'Name' columb stayed ?

In [None]:
titanic.drop?

In [None]:
titanic.drop(['Name'], axis=1 , inplace=True)
# alternative : titanic = titanic.drop(['Name'],axis=1)
titanic.head()

#### Exercise: Drop SibSp , Parch , Ticket , Fare , Cabin all together

### reminder
##### python function 

def func(a,b):

    return a+b
    
##### dictionary 

dic  = {'a':1 , 'b':2 , 'c':'OrenHazan'}




In [None]:
#option 1 :
def SetGender(sex):
    if sex=='male':
        return 1
    else:
        return 0
    
titanic['Gender1'] = titanic.Sex.apply(SetGender)
titanic.head()

In [None]:
#option 2 :
titanic['Gender2'] = titanic.Sex.map({'male':1 , 'female':0})
titanic.head()

### Exercise
* set Gender1 as Gender
* Embarked to Port
* drop Sex , Embarked , Gender1 , Gender2
* drop all missing values rows using dropna()

# Part 2 : ML process

![flow](images\flow.png)

In [None]:
from sklearn.model_selection import train_test_split
train,test = train_test_split(
    titanic,                # The dataset we want to split
    train_size=0.7,    # The proportional size of our training set
    stratify=titanic.Survived, # The labels are used for stratification
    random_state=40   # Use the same random state for reproducibility
)

In [None]:
#separate the target from the dataset
x_train = train.drop(['Survived'],axis=1)
y_train = train.Survived
x_test = test.drop(['Survived'],axis=1)
y_test = test.Survived

### General scikit learn modeling :
1. import your model
2. create an instance of your model (you can set up your parameters or use default)
3. train your model using fit() method on your <b>train</b> data
4. apply your model on the <b>test</b> data using predict() method for prediction
5. test your model on the y_test data using your measure of choise (accuarcy , MSE , AUC , log loss ...)



In [None]:
#step1
from sklearn.neighbors import KNeighborsClassifier

In [None]:
#step2
knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
#step3
knn.fit(x_train,y_train)

In [None]:
#step4
y_predicted = knn.predict(x_test)

In [None]:
#step5
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predicted)


![cm](images\cm.jpg)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_predicted)

### Exercise : draw the confusion matrix

#### go through the process using diffrent random seed

![shock](images\wait.gif)

### Cross validation

![cv](images\cv.jpg)

In [None]:
from sklearn.model_selection import cross_val_score
knn5 = KNeighborsClassifier()
cross_val_score(knn,x_train,y_train,n_jobs=-1,scoring='accuracy',cv=5)   #.mean()

### Exercise :
* create a function that gets classifier,x_test,y_test,scoring,cv and  the mean score of cross_val_score
* Create 3 models with default parameters - KNeighborsClassifier , RandomForestClassifier , DecisionTreeClassifier
* apply the function on the 3 models using cv=5 & scoring='accuracy' and print the result
* try using diffrent parameters for the models and see if you can improve your score

## Paremeter tuning


![rubik](images\rubik.gif)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
n_trees = range(10,200)
max_features = ['sqrt','log2']
cores =-1
param_dict = dict(n_estimators=n_trees, max_features=max_features)
grid = RandomizedSearchCV(forest, param_dict, cv=5, scoring='accuracy')

In [None]:
grid.fit(x_train,y_train)

In [None]:
print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)

### Finaly , test against the test_set

In [None]:

RF = RandomForestClassifier(n_estimators=grid.best_params_["n_estimators"],max_features =grid.best_params_["max_features"])
RF.fit(x_train, y_train)
y_pred = RF.predict(x_test)
print ('final score accuracy : ')
print (accuracy_score(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test,y_pred),annot=True)

#### see feature importance

In [None]:
sns.barplot(y=x_test.columns,x=RF.feature_importances_,orient="h" )

### Exercise : tune up DecisionTreeClassifier 
##### if time allows...
