# Decision Trees Exersice 

### First Step : Download __[Titanic Data](https://www.kaggle.com/c/titanic/data)__  
- **NOTE** : use only  file `train.csv`from data.
#### 1. Import library of pandas and numpy

In [34]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report, precision_recall_curve, roc_curve, roc_auc_score
#import scikitplot as skplt
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### 2. Load Data using Pandas

In [2]:
df = pd.read_csv('train.csv')

#### 3. Check column in Dataset and Drop useless columns 

- **Hint** : useless columns `Name`, `Ticket`,`PassengerId` and `Cabin`

In [3]:
df_copy = df.copy()

In [4]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [9]:
columns = df_copy[['Name','Ticket','PassengerId','Cabin']]

In [10]:
df_cleaned = df_copy.drop(columns, axis = 1)

In [12]:
df_cleaned.shape

(891, 8)

In [13]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


#### 4. Checking null values in each column and Handle with Mode

- **Hint** : 
- 1. for `Age` use Mode since there is outliers in this column
- 2. for `Embarked` use Mode since it is a categorical variable


In [15]:
df_cleaned['Age'].fillna(df_cleaned['Age'].mode()[0], inplace=True)

In [17]:
df_cleaned['Embarked'].fillna(df_cleaned['Embarked'].mode()[0], inplace=True)

#### 5. Handle Categorical Data using `get_Dummies()` in pandas 

- **Hint** : Handle only columns `Sex` and `Embarked`
- 
 Read this document on how to use [`get_Dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)  

In [19]:
df_cleaned['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [20]:
emb_categ = pd.get_dummies(df_cleaned['Embarked'])

In [21]:
emb_categ

Unnamed: 0,C,Q,S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


In [25]:
# merge dummies to original data frame
df_testing = pd.concat([df_cleaned,emb_categ], axis =1)
df_testing.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,C,Q,S
0,0,3,male,22.0,1,0,7.25,S,0,0,1
1,1,1,female,38.0,1,0,71.2833,C,1,0,0
2,1,3,female,26.0,0,0,7.925,S,0,0,1
3,1,1,female,35.0,1,0,53.1,S,0,0,1
4,0,3,male,35.0,0,0,8.05,S,0,0,1


In [43]:
# convert sex column into binary variable where male = 1 , femal;e = 0
df_testing["Sex"] = np.where(df_testing["Sex"].str.contains("female"), 0, 1)

In [44]:
df_testing['Sex'].unique()

array([1, 0])

#### 6.  Separate  X ( features ) from  Y (labels)
**Hint** : 
- goal : predict only passengers survive and or not be

In [45]:
df_testing.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,C,Q,S
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,28.56697,0.523008,0.381594,32.204208,0.188552,0.08642,0.725028
std,0.486592,0.836071,0.47799,13.199572,1.102743,0.806057,49.693429,0.391372,0.281141,0.446751
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,22.0,0.0,0.0,7.9104,0.0,0.0,0.0
50%,0.0,3.0,1.0,24.0,0.0,0.0,14.4542,0.0,0.0,1.0
75%,1.0,3.0,1.0,35.0,1.0,0.0,31.0,0.0,0.0,1.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0


In [46]:
X = df_testing.iloc[:,1:]
y = df_testing.iloc[:,:1]


#### 7. Split data into the Training data and Test data by `random_state=5` and `test_size=0.25`

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.3,random_state = 5)


#### 8. Scale all Data using `StandardScaler` 

In [48]:
# Perform pre-processing to scale numeric features
scale = preprocessing.StandardScaler()
X_train = scale.fit_transform(X_train)

# Test features are scaled using the scaler computed for the training features
X_test = scale.transform(X_test)

ValueError: ignored

#### 9.Building your model ( Decision Tree )
 Use the default sklearn parameters with `random_state=1`

#### 10. Use  pruning  decision Tree to determine the best maximum depth for test data

#### 11.Building your model ( Random Forest)
 - Use parameters with  `oob_score=True` , `random_state=1`, ` warm_start=True` and `n_jobs=-1` .
 - Use number of trees in range 200 : 300

#### 12. Calculate Confusion Matrix ,precision,recall and F1-score for the Decision Tree model & Random Forest

#### 13.Building your model (SVM )
 Use the default sklearn parameters with `random_state=1`

#### 12. Calculate Confusion Matrix ,precision,recall and F1-score for the SVM model