## Introduction of Kaggle

- Use pandas.read_csv() to import train.csv and assign it to the variable train.
- Use DataFrame.shape to calculate the number of rows and columns in train, and assign the result to train_shape.
- Click Run to run your code, and use the variable inspector to view the four variables you just created.

In [19]:
import pandas as pd
import matplotlib as plt
from sklearn.linear_model import LogisticRegression

test = pd.read_csv("test.csv")
test_shape = test.shape

In [20]:
train = pd.read_csv("train.csv")
train_shape = train.shape

In [21]:
print(test_shape)

(418, 11)


In [22]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## Exploring the data

- Use DataFrame.pivot_table() to pivot the train dataframe:
    - Use "Pclass" for the index parameter.
    - Use "Survived" for the values parameter.
- Use DataFrame.plot.bar() to plot the pivot table.

In [23]:
sex_pivot = train.pivot_table(index="Cabin",values="Survived")
sex_pivot.plot.bar()
plt.show()

AttributeError: module 'matplotlib' has no attribute 'show'

## Exploring and Converting the Age Column

- Create the cut_points and label_names lists to split the Age column into six categories:
    - Missing, from -1 to 0
    - Infant, from 0 to 5
    - Child, from 5 to 12
    - Teenager, from 12 to 18
    - Young Adult, from 18 to 35
    - Adult, from 35 to 60
    - Senior, from 60 to 100
- Apply the process_age() function on the train dataframe, assigning the result to train.
- Apply the process_age() function on the test dataframe, assigning the result to test.
- Use DataFrame.pivot_table() to pivot the train dataframe by the Age_categories column.
- Use DataFrame.plot.bar() to plot the pivot table

In [24]:
def process_age(df,cut_points,label_names):
    df["Age"] = df["Age"].fillna(-0.5)
    df["Age_categories"] = pd.cut(df["Age"],cut_points,labels=label_names)
    return df

In [25]:
cut_points = [-1,0,5,12,18,35,60,100]
label_names = ["Missing","Infant","Child","Teenager","Young Adult","Adult","Senior"]

In [26]:
train = process_age(train,cut_points,label_names)
test = process_age(test,cut_points,label_names)

In [29]:
train_pivot = train.pivot_table(index="Age_categories",values="Survived")
train_pivot.plot.bar()
plt.show()

AttributeError: module 'matplotlib' has no attribute 'show'

##  Preparing our Data for Machine Learning

- Use the create_dummies() function to create dummy variables for the Sex column:
    - in the train dataframe.
    - in the test dataframe.
- Use the create_dummies() function to create dummy variables for the Age_categories column:
    - in the train dataframe.
    - in the test dataframe.

In [30]:
def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df = pd.concat([df,dummies],axis=1)
    return df

train = create_dummies(train,"Pclass")
test = create_dummies(test,"Pclass")

train = create_dummies(train,"Sex")
test = create_dummies(test,"Sex")

train = create_dummies(train,"Age_categories")
test = create_dummies(test,"Age_categories")






In [31]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'Age_categories',
       'Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
       'Age_categories_Missing', 'Age_categories_Infant',
       'Age_categories_Child', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Senior'],
      dtype='object')

## Creating Our First Machine Learning Model

- Instantiate a LogisticRegression object called lr.
- Use LogisticRegression.fit() to fit the model on the train dataset using:
    - The columns contained in columns as the first (X) parameter.
    - The Survived column as the second (y) parameter.


In [33]:
columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
       'Age_categories_Missing','Age_categories_Infant',
       'Age_categories_Child', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Senior']

from sklearn.linear_model import LogisticRegression

In [34]:
lr=LogisticRegression()
lr.fit(train[columns],train["Survived"])


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Splitting our training data

- Use the model_selection.train_test_split() function to split the train dataframe using the following parameters:
    - test_size of 0.2.
    - random_state of 0.
- Assign the four returned objects to train_X, test_X, train_y, and test_y.

In [39]:
holdout = test # from now on we will refer to this
               # dataframe as the holdout data

from sklearn.model_selection import train_test_split

columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
       'Age_categories_Missing','Age_categories_Infant',
       'Age_categories_Child', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Senior']

x=train[columns]
y=train["Survived"]
train_X,test_X,train_y,test_y=train_test_split(x,y,test_size=0.2,random_state=0)

In [40]:
test_y.value_counts()

0    110
1     69
Name: Survived, dtype: int64

## Making Predictions and Measuring their Accuracy

- Instantiate a new LogisticRegression() object, lr.
- Fit the model using train_X and train_y.
- Make predictions using test_X and assign the results to predictions.
- Use accuracy_score() to compare test_y and predictions, assigning the result to accuracy
- Print the accuracy variable.

In [41]:
lr=LogisticRegression()
lr.fit(train_X,train_y)
prediction=lr.predict(test_X)

In [42]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_y,prediction)

array([[96, 14],
       [20, 49]])

In [43]:
from sklearn.metrics import accuracy_score
accuracy_score(test_y,prediction)

0.8100558659217877

In [44]:
test_X.columns

Index(['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
       'Age_categories_Missing', 'Age_categories_Infant',
       'Age_categories_Child', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Senior'],
      dtype='object')

In [45]:
import numpy as np
jc = np.array([0,0,1,0,1,0,0,0,0,1,0,0]).reshape(1,-1)
js = np.array([0,1,0,0,1,0,0,0,0,0,1,0]).reshape(1,-1)
kf = np.array([1,0,0,0,1,0,0,0,0,1,0,0]).reshape(1,-1)

In [64]:
print(jc)

[[0 0 1 0 1 0 0 0 0 1 0 0]]


In [46]:
print(lr.predict_proba(jc))

[[0.88644919 0.11355081]]


## Making Predictions and Measuring their Accuracy 

- Instantiate a new LogisticRegression() object, lr.
- Fit the model using train_X and train_y.
- Make predictions using test_X and assign the results to predictions.
- Use accuracy_score() to compare test_y and predictions, assigning the result to accuracy
- Print the accuracy variable.


In [47]:
from sklearn.metrics import accuracy_score

lr = LogisticRegression()
lr.fit(train_X, train_y)
predictions = lr.predict(test_X)

In [49]:
accuracy=accuracy_score(test_y,predictions)
print(accuracy)

0.8100558659217877


## Using Cross Validation for More Accurate Error Measurement

- Instantiate a new LogisticRegression() object, lr.
- Use model_selection.cross_val_score() to perform cross-validation on our data and assign the results to scores:
    -  Use the newly created lr as the estimator.
    - Use all_X and all_y as the input data.
    - Specify 10 folds to be used.
- Use the numpy.mean() function to calculate the mean of scores and assign the result to accuracy.
- Print the variables scores and accuracy.

We dont have the valies all_x, all_y

In [50]:
from sklearn.model_selection import cross_val_score
import numpy as np
lr = LogisticRegression()
scores = cross_val_score(lr, all_X, all_y, cv=10)
accuracy = np.mean(scores)
print(scores)
print(accuracy)

NameError: name 'all_X' is not defined

# Making Predictions on Unseen Data

- Instantiate a new LogisticRegression() object, lr.
- Use the fit() method to train the model lr using all of the Kaggle training data: all_X and all_y.
- Make predictions using the holdout data and assign the result to holdout_predictions

In [51]:
columns = ['Pclass_1', 'Pclass_2', 'Pclass_3', 'Sex_female', 'Sex_male',
       'Age_categories_Missing','Age_categories_Infant',
       'Age_categories_Child', 'Age_categories_Teenager',
       'Age_categories_Young Adult', 'Age_categories_Adult',
       'Age_categories_Senior']

In [52]:
lr = LogisticRegression()
lr.fit(all_X,all_y)
holdout_predictions = lr.predict(holdout[columns])

NameError: name 'all_X' is not defined

## Creating a Submission File

- Create a dataframe submission that matches Kaggle's specification.
- Use the to_csv() method to save the submission dataframe using the filename submission.csv, using the documentation to look up the correct syntax.


In [None]:
holdout_ids = holdout["PassengerId"]
submission_df = {"PassengerId": holdout_ids,
                 "Survived": holdout_predictions}
submission = pd.DataFrame(submission_df)

submission.to_csv("submission.csv",index=False)