![](https://upload.wikimedia.org/wikipedia/en/b/bb/Titanic_breaks_in_half.jpg)

# Project 1: [Titanic](https://www.kaggle.com/c/titanic/data)
---

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this project, you will exercise your skills with loading data, python data structures, and Pandas to identify charactaristics of Titanic survivors!

---
#### Your goals should be to:
* Practice python programming including loops, conditionals, types, functions, and data structures
* Start thinking critically about manipulating, organizing, and interpreting data
* Troubleshoot errors

---
#### Getting Started:
* **fork** the repository on git.generalassemb.ly
* **clone** your forked repo

---
#### Submission:
* You should be working on a **fork** of the GA project one repository. 
* Use **git** to manage versions of your project. Make sure to `add`, `commit`, and `push` your changes to **your fork** of the github 
* Submit a link to your project repository in the submission form by **Friday, 9/29 11:59 PM**. You will then receive the solutions.
* Create a copy of your original notebook (file > make a copy in jupyter notebook)
* In the copy, use the solutions to correct your work. Make sure to take note of your successes and struggles. Did you learn anything new from correcting your work?
* Submit the corrected version by **Sunday, 10/1 11:59 PM** to receive instructor feedback on your work. ***Projects submitted after this deadline will not receive instructor feedback.***

### Considerations:

* You will be generating long data strutures- avoid displaying the whole thing. Display just the first or last few entries and look at the length or shape to check whether your code gives you back what you want and expect.
* Make functions whenever possiblle!
* Be explicit with your naming. You may forget what `this_list` is, but you will have an idea of what `passenger_fare_list` is. Variable naming will help you in the long run!
* Don't forget about tab autocomplete!
* Use markdown cells to document your planning, thoughts, and results. 
* Delete cells you will not include in your final submission
* Try to solve your own problems using this framework:
  1. Check your spelling
  2. Google your errors. Is it on stackoverflow?
  3. Ask your classmates
  4. Ask a TA or instructor
* Do not include errors or stack traces (fix them!)

# 1. Using the `with open()` method in the `csv` library, load the titanic dataset into a list of lists.

* The `type()` of your dataset should be `list`
* The `type()` of each element in your dataset should also be `list`
* The `len()` of your dataset should be 892 (892 rows, including the header)
* The `len()` of each row element in your dataset should be have a `len()` of 12
* Print out the first 3 rows including the header to check your data.

In [1]:
import csv
from IPython.display import display
import numpy as np

In [2]:

with open('titanic.csv') as f:
    raw_pd = csv.reader(f)
    data = []
    
    for row in raw_pd:
        data.append(row)
        
print(data)

[['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], ['1', '0', '3', 'Braund, Mr. Owen Harris', 'male', '22', '1', '0', 'A/5 21171', '7.25', '', 'S'], ['2', '1', '1', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'female', '38', '1', '0', 'PC 17599', '71.2833', 'C85', 'C'], ['3', '1', '3', 'Heikkinen, Miss. Laina', 'female', '26', '0', '0', 'STON/O2. 3101282', '7.925', '', 'S'], ['4', '1', '1', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'female', '35', '1', '0', '113803', '53.1', 'C123', 'S'], ['5', '0', '3', 'Allen, Mr. William Henry', 'male', '35', '0', '0', '373450', '8.05', '', 'S'], ['6', '0', '3', 'Moran, Mr. James', 'male', '', '0', '0', '330877', '8.4583', '', 'Q'], ['7', '0', '1', 'McCarthy, Mr. Timothy J', 'male', '54', '0', '0', '17463', '51.8625', 'E46', 'S'], ['8', '0', '3', 'Palsson, Master. Gosta Leonard', 'male', '2', '3', '1', '349909', '21.075', '', 'S'], ['9', '1', '3', 'Johnson, Mrs. Osc

# 2. Separate the first header row from the rest of your dataset. 

* The header should be a list of the column names
* The data should be the rest of your data
* Display the header and the first row of the dataset zipped together using `zip`
* Your result should look like...


```
[('PassengerId', '1'),
 ('Survived', '0'),
 ('Pclass', '3'),
 ...
 ('Embarked', 'S')]
 ```

In [3]:
header = data[0]
new_pd = []
data.remove(data[0])
#data = new_pd[1:]
print(header)
print(data[0])

['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
['1', '0', '3', 'Braund, Mr. Owen Harris', 'male', '22', '1', '0', 'A/5 21171', '7.25', '', 'S']


In [4]:
print(list(zip(header,data[1])))


[('PassengerId', '2'), ('Survived', '1'), ('Pclass', '1'), ('Name', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)'), ('Sex', 'female'), ('Age', '38'), ('SibSp', '1'), ('Parch', '0'), ('Ticket', 'PC 17599'), ('Fare', '71.2833'), ('Cabin', 'C85'), ('Embarked', 'C')]


# 3. Using a `for` loop, load your data into a `dict` called `data_dict`.

* The keys of your `data_dict` should be `PassengerId`
* The values of your `data_dict` should be dictionaries...
  * Each of these dictionaries should reperesent a column value within a row
  * The keys should be the names of the columns
  * The values should be the values of that column
  
The beginning of your `data_dict` should look like: 

    {'1': {'Age': '22',
      'Cabin': '',
      'Embarked': 'S',
      'Fare': '7.25',
      'Name': 'Braund, Mr. Owen Harris',
      'Parch': '0',
      'Pclass': '3',
      'Sex': 'male',
      'SibSp': '1',
      'Survived': '0',
      'Ticket': 'A/5 21171'},
     '10': {'Age': '14',
      'Cabin': '',
      'Embarked': 'C',
      'Fare': '30.0708',
      'Name': 'Nasser, Mrs. Nicholas (Adele Achem)',
      'Parch': '0',
      'Pclass': '2',
      'Sex': 'female',
      'SibSp': '1',
      'Survived': '1',
      'Ticket': '237736'},
      ...
      }

In [5]:
len(header)

12

In [6]:
len(data[10])

12

In [7]:
list(zip(header, data[10]))

[('PassengerId', '11'),
 ('Survived', '1'),
 ('Pclass', '3'),
 ('Name', 'Sandstrom, Miss. Marguerite Rut'),
 ('Sex', 'female'),
 ('Age', '4'),
 ('SibSp', '1'),
 ('Parch', '1'),
 ('Ticket', 'PP 9549'),
 ('Fare', '16.7'),
 ('Cabin', 'G6'),
 ('Embarked', 'S')]

In [8]:
data_dict = {}


for row in data:
    new = {i:j for i,j in zip(header, row)}
    data_dict[new['PassengerId']] = new
    
    
data_dict
    



{'1': {'Age': '22',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '7.25',
  'Name': 'Braund, Mr. Owen Harris',
  'Parch': '0',
  'PassengerId': '1',
  'Pclass': '3',
  'Sex': 'male',
  'SibSp': '1',
  'Survived': '0',
  'Ticket': 'A/5 21171'},
 '10': {'Age': '14',
  'Cabin': '',
  'Embarked': 'C',
  'Fare': '30.0708',
  'Name': 'Nasser, Mrs. Nicholas (Adele Achem)',
  'Parch': '0',
  'PassengerId': '10',
  'Pclass': '2',
  'Sex': 'female',
  'SibSp': '1',
  'Survived': '1',
  'Ticket': '237736'},
 '100': {'Age': '34',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '26',
  'Name': 'Kantor, Mr. Sinai',
  'Parch': '0',
  'PassengerId': '100',
  'Pclass': '2',
  'Sex': 'male',
  'SibSp': '1',
  'Survived': '0',
  'Ticket': '244367'},
 '101': {'Age': '28',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '7.8958',
  'Name': 'Petranec, Miss. Matilda',
  'Parch': '0',
  'PassengerId': '101',
  'Pclass': '3',
  'Sex': 'female',
  'SibSp': '0',
  'Survived': '0',
  'Ticket': '349245'},
 '102': {'Age': ''

# 4. Repeat step 3 using a dictionary comprehension.

* Using `==`, check if your `data_dict` from your `for` loop is the same as the one from your dictionary comprehension.

In [9]:
# Your code here
data_dict_comp = {}

data_dict_comp = {
    col_name:{data[col_name] for ID, data in data_dict_comp.items()} for col_name in header
}

print(data_dict_comp)
    
if data_dict == data_dict_comp:
    print('yep')

{'PassengerId': set(), 'Survived': set(), 'Pclass': set(), 'Name': set(), 'Sex': set(), 'Age': set(), 'SibSp': set(), 'Parch': set(), 'Ticket': set(), 'Fare': set(), 'Cabin': set(), 'Embarked': set()}


# 5. Transform your `data_dict` to be oriented by column and call it `data_dict_columns`

* Currently, our `data_dict` is oriented by row, indexed by `"PassengerId"`. 
* Transform your data so that the title of each row is a key, the values are of type `list` and represent column vectors.

If you display `data_dict_columns`, the beginning should look like...

    {'Age': ['25',
      '36',
      '24',
      '40',
      '45',
      '2',
      '24',
      '28',
      '33',
      '26',
      '39',
      ...

In [10]:
data_dict

{'1': {'Age': '22',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '7.25',
  'Name': 'Braund, Mr. Owen Harris',
  'Parch': '0',
  'PassengerId': '1',
  'Pclass': '3',
  'Sex': 'male',
  'SibSp': '1',
  'Survived': '0',
  'Ticket': 'A/5 21171'},
 '10': {'Age': '14',
  'Cabin': '',
  'Embarked': 'C',
  'Fare': '30.0708',
  'Name': 'Nasser, Mrs. Nicholas (Adele Achem)',
  'Parch': '0',
  'PassengerId': '10',
  'Pclass': '2',
  'Sex': 'female',
  'SibSp': '1',
  'Survived': '1',
  'Ticket': '237736'},
 '100': {'Age': '34',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '26',
  'Name': 'Kantor, Mr. Sinai',
  'Parch': '0',
  'PassengerId': '100',
  'Pclass': '2',
  'Sex': 'male',
  'SibSp': '1',
  'Survived': '0',
  'Ticket': '244367'},
 '101': {'Age': '28',
  'Cabin': '',
  'Embarked': 'S',
  'Fare': '7.8958',
  'Name': 'Petranec, Miss. Matilda',
  'Parch': '0',
  'PassengerId': '101',
  'Pclass': '3',
  'Sex': 'female',
  'SibSp': '0',
  'Survived': '0',
  'Ticket': '349245'},
 '102': {'Age': ''

In [11]:
data_dict_columns = {}

for col_name in header:
    data_dict_columns[col_name] = []
    
    
    
#print(data_dict_columns)

for ID, SinglePassenger in data_dict.items():
   # print(SinglePassenger)
    for field, value in SinglePassenger.items():
        data_dict_columns[field].append(value)
          
        
print(data_dict_columns["Name"])

['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry', 'Moran, Mr. James', 'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard', 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)', 'Nasser, Mrs. Nicholas (Adele Achem)', 'Sandstrom, Miss. Marguerite Rut', 'Bonnell, Miss. Elizabeth', 'Saundercock, Mr. William Henry', 'Andersson, Mr. Anders Johan', 'Vestrom, Miss. Hulda Amanda Adolfina', 'Hewlett, Mrs. (Mary D Kingcome) ', 'Rice, Master. Eugene', 'Williams, Mr. Charles Eugene', 'Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)', 'Masselmani, Mrs. Fatima', 'Fynney, Mr. Joseph J', 'Beesley, Mr. Lawrence', 'McGowan, Miss. Anna "Annie"', 'Sloper, Mr. William Thompson', 'Palsson, Miss. Torborg Danira', 'Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)', 'Emir, Mr. Farred Chehab', 'Fortune, Mr. Charles Alexander', 'O\'Dwyer, Miss. Ellen "Nelli

# 6. Data Types

What is the current `type` of each column? What do you think the data type of each column *should* be? The data types in Python are...

* `int`
* `float`
* `str`
* `bool`
* `tuple`
* `list`
* `dict`
* `set`

In a markdown cell, describe what each column represents and what the `type` of each value should be. **Extra:** If you want to be fancy, use a [markdown table](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#tables) to display your results.

| Column        | Type          | Desired Type  |
| ------------- |:-------------:| -----:        |
|Passenger      |     string    |   Int         |
|Survived       |     string    |   Int     |
| Pclass        |     string    |   int         |
|Name           |     string    |   string      |
|Sex            |     string    |   string      |
|Age            |     string    |   int         |
|SibSp          |     string    |   int         |
|Parch          |     string    |   Int         |
|Ticket         |     string    |   string      |
|Fare           |     string    |   Float       |
|Cabin          |     string    |   string      |
|Embarked       |     string    |   string      |


# 7. Transform each column to the appropriate type if needed.

Build a function called `transform_column` that takes arguments for a `data_dict`, `column_name`, and `datatype`, and use it to transofm the columns that need transformation.

**NOTE:** There are values in this dataset that cannot be directly cast to a numerical value. Use `if/then` or `try/except` statements to handle errors. 

**To help identify potential sources of errors, explore the `set` of values in each column.**

In [12]:
float(data_dict_columns["PassengerId"][1])

2.0

In [13]:
# Your code here
def transform_col(data_dict, column_name, data_type):
   
    if data_type == 'int':
        for i,index in enumerate(data_dict[column_name]):
            if data_dict[column_name][i].isdigit():
                data_dict[column_name][i] = int(data_dict[column_name][i])
            else:
                data_dict[column_name][i] = 0
                
    if data_type == 'float':
        for i,index in enumerate(data_dict[column_name]):
            if data_dict[column_name][i].isdigit():
                data_dict[column_name][i] = float(data_dict[column_name][i])
            else:
                data_dict[column_name][i] = 0
    
    if data_type == 'bool':
        for i,index in enumerate(data_dict[column_name]):
            if data_dict[column_name][i] == 0 :
                data_dict[column_name][i] = bool(data_dict[column_name][i])
            else:
                bool(data_dict[column_name][i])
            
    return data_dict[column_name]
            
              



transform_col(data_dict_columns, 'PassengerId','int') 
#print(type(data_dict_columns["PassengerId"][1]))
transform_col(data_dict_columns, 'Survived','int')
transform_col(data_dict_columns, 'Pclass','int')
transform_col(data_dict_columns, 'Age','int')
transform_col(data_dict_columns, 'SibSp','int')
transform_col(data_dict_columns, 'Parch','int')
transform_col(data_dict_columns, 'Fare','int')











[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 16,
 0,
 13,
 18,
 0,
 26,
 13,
 0,
 0,
 0,
 0,
 0,
 263,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 52,
 0,
 0,
 18,
 0,
 0,
 21,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 26,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 80,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 29,
 0,
 9,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 263,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 23,
 26,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 52,
 0,
 0,
 0,
 0,
 0,
 0,
 21,
 0,
 0,
 0,
 0,
 0,
 13,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 26,
 13,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 26,
 13,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 55,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 13,
 0,
 0,
 0,
 0,
 39,
 0,
 50,
 0,
 0,
 0,
 0,
 13,
 13,
 0,
 26,
 0,
 0,
 0,
 0,
 0,
 13,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 31,
 0,
 21,
 0,
 13,
 0,
 0,
 0,
 27,
 0,
 0,
 0,
 13,
 0,
 0,
 90,
 0,
 0,
 0,
 13,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 26,
 0,
 0,
 0,


In [30]:
data_dict_columns

{'Age': [22,
  38,
  26,
  35,
  35,
  0,
  54,
  2,
  27,
  14,
  4,
  58,
  20,
  39,
  14,
  55,
  2,
  0,
  31,
  0,
  35,
  34,
  15,
  28,
  8,
  38,
  0,
  19,
  0,
  0,
  40,
  0,
  0,
  66,
  28,
  42,
  0,
  21,
  18,
  14,
  40,
  27,
  0,
  3,
  19,
  0,
  0,
  0,
  0,
  18,
  7,
  21,
  49,
  29,
  65,
  0,
  21,
  0,
  5,
  11,
  22,
  38,
  45,
  4,
  0,
  0,
  29,
  19,
  17,
  26,
  32,
  16,
  21,
  26,
  32,
  25,
  0,
  0,
  0,
  30,
  22,
  29,
  0,
  28,
  17,
  33,
  16,
  0,
  23,
  24,
  29,
  20,
  46,
  26,
  59,
  0,
  71,
  23,
  34,
  34,
  28,
  0,
  21,
  33,
  37,
  28,
  21,
  0,
  38,
  0,
  47,
  0,
  22,
  20,
  17,
  21,
  0,
  29,
  24,
  2,
  21,
  0,
  0,
  0,
  54,
  12,
  0,
  24,
  0,
  45,
  33,
  20,
  47,
  29,
  25,
  23,
  19,
  37,
  16,
  24,
  0,
  22,
  24,
  19,
  18,
  19,
  27,
  9,
  0,
  42,
  51,
  22,
  0,
  0,
  0,
  51,
  16,
  30,
  0,
  0,
  44,
  40,
  26,
  17,
  1,
  9,
  0,
  45,
  0,
  28,
  61,
  4,
  1,
  21,
  56,


# 8. Build functions to calculate the mean, sample standard deviation, and median of a list of ints or floats. Use `scipy.stats.mode` or build your own mode function!


If you filled any missing values with `np.NaN`, you may need to handle that in your functions (look up `np.isnan()`).

If building a `mode` function is too difficult, you import mode from `scipy.stats` using `from scipy.stats import mode`.

**Optional:**  Build a function for calculating the Mode that returns the mode value *and* the count of that value. Mode is tricky, so start by building a function that counts the occurances of each value. You may also need to sort using a `key` with a `lambda function` inside. You may also find a `defaultdict` useful.

Mean

In [31]:
def this_mean(data_list):
    m =0
    for i in data_list:
        if data_list[i] != 0:
            m += i
        else:
            m+=0
    return m/len(data_list)
            

Standard Deviation

In [32]:
def this_std(data_list):
    return np.std(data_list)
    

Median

In [33]:
def this_median(data_list):
    return np.median(data_list)

Mode

In [34]:
def this_mode(data_list)
dict_val = {}
    for i in data_list:
        
    
    return

SyntaxError: invalid syntax (<ipython-input-34-6c43dc89d220>, line 1)

# 9. Summary Statistics of Numerical Columns

For numerical columns, what is the mean, standard deviation, mean, and mode for that data? Which measure of central tendency is the most descriptive of each column? Why? Explain your answer in a markdown cell.

In [35]:
print(this_mean(data_dict_columns['Age']))
print(this_mean(data_dict_columns['Fare']))
print(this_mean(data_dict_columns['SibSp']))
print(this_mean(data_dict_columns['Pclass']))
print(this_mean(data_dict_columns['Survived']))
print(this_median(data_dict_columns['Age']))
print(this_median(data_dict_columns['Fare']))
print(this_median(data_dict_columns['SibSp']))
print(this_median(data_dict_columns['Pclass']))
print(this_std(data_dict_columns['Age']))
print(this_std(data_dict_columns['Fare']))
print(this_std(data_dict_columns['SibSp']))
print(this_std(data_dict_columns['Pclass']))


15.241301907968575
0.4983164983164983
0.2884399551066218
2.308641975308642
0.3838383838383838
24.0
0.0
0.0
3.0
17.733409971
22.6557937084
1.10212443509
0.83560193348


the mean is the most descriptive as it includes uses the most accurate data

# 10. Splitting the Data to Predicting Survival

For all the passengers in the dataset, the mean survival rate is around .38 (38% of the passengers survived). From our data, we may be able to profile who survived and who didn't!

Split the data by pclass. Does the class a passenger was in affect survivability? You can do this by:
* Creating a list of `True` and `False` values conditional on a column's value
* Taking the mean of the `Survived` column where those values are `True`

In [205]:
Lived = titanic_df[titanic_df['Survived'] == 1]
Died = titanic_df[titanic_df['Survived'] == 0]
def group_by_Pclass(data_dict, split_var, val):
    Pclass = data_dict[split_var] == val
    data_dict = data_dict[Pclass]
    return data_dict
    
    
P1 = group_by_Pclass(data_dict_columns, 'Pclass', 1)
    
    

            

KeyError: False

# 11. Independent Work

Use the techniques from step 10 to make different conditional splits in the `Survived` column. Can you find a combination of splits that maximizes the survival rate?

In [None]:
# Your code here

# 12. Distributions

### A) For each of your numeric features, use `pyplot` subplots to plot a hisogram for each feature.

* Make sure to title each subplot.
* If you get an error, it may be caused by `np.NaN`

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Your code here

### B) for each of these values, what's the 90% confidence interval of the population mean?

* Create a function to find the confidence interval, and use it on each of the numeric columns.
* What's your interpretation of the interval?

In [None]:
from scipy import stats
# Your code here

# 13. Pandas

### A: Load the titanic csv into a `DataFrame` using `pd.read_csv()`

In [135]:
import pandas as pd

In [136]:
titanic_df = pd.read_csv('titanic.csv')

### B: Display the first 5 rows, the last 4 rows, and a sample of 3 rows.

In [137]:
# Your code here
titanic_df.head()




Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [138]:
titanic_df.tail()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [139]:
print(titanic_df[6:9])

   PassengerId  Survived  Pclass  \
6            7         0       1   
7            8         0       3   
8            9         1       3   

                                                Name     Sex   Age  SibSp  \
6                            McCarthy, Mr. Timothy J    male  54.0      0   
7                     Palsson, Master. Gosta Leonard    male   2.0      3   
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   

   Parch  Ticket     Fare Cabin Embarked  
6      0   17463  51.8625   E46        S  
7      1  349909  21.0750   NaN        S  
8      2  347742  11.1333   NaN        S  


### C: Create a row mask that is `True` when `Pclass == 3`. Use this to mask your `DataFrame`. Find the mean of the `Survived` column. Is it the same as what we calculated in part 10?

In [140]:
pclass3_mask = titanic_df['Pclass'] == 3

mean_t = titanic_df[pclass3_mask].mean()

print(mean_t)

PassengerId    439.154786
Survived         0.242363
Pclass           3.000000
Age             25.140620
SibSp            0.615071
Parch            0.393075
Fare            13.675550
dtype: float64


### D: Using a `.groupby()`, what is the mean of the survival column grouped by `Pclass` and `Sex`. What are your observations?

In [141]:
# Your code here
new_df = titanic_df.groupby(by =['Pclass', "Sex"], axis = 0)
new_df.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,female,469.212766,0.968085,34.611765,0.553191,0.457447,106.125798
1,male,455.729508,0.368852,41.281386,0.311475,0.278689,67.226127
2,female,443.105263,0.921053,28.722973,0.486842,0.605263,21.970121
2,male,447.962963,0.157407,30.740707,0.342593,0.222222,19.741782
3,female,399.729167,0.5,21.75,0.895833,0.798611,16.11881
3,male,455.51585,0.135447,26.507589,0.498559,0.224784,12.661633


### E: Survival Rate by Age Range:  `pd.cut()` takes two arguments: A `list`, `Series`, or `array`, and a list of bins. Create a new column in your `DataFrame` using `pd.cut()` that groups your ages into bins of 5 years. Then, use `.groupby()` to display the survival rate and count for each age group

In [164]:
# Your code here

bins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50 ,55, 60, 65, 70, 100]

cut_df = pd.cut(titanic_df['Age'], bins)
titanic_df = titanic_df.assign(AgeBin=cut_df.values)

titanic_df

new_df_2 = titanic_df.groupby(by = ['AgeBin'], axis = 0, group_keys = True)
new_df_2['Survived'].mean()



AgeBin
(0, 5]       0.704545
(5, 10]      0.350000
(10, 15]     0.578947
(15, 20]     0.343750
(20, 25]     0.344262
(25, 30]     0.388889
(30, 35]     0.465909
(35, 40]     0.417910
(40, 45]     0.361702
(45, 50]     0.410256
(50, 55]     0.416667
(55, 60]     0.388889
(60, 65]     0.285714
(65, 70]     0.000000
(70, 100]    0.200000
Name: Survived, dtype: float64

# 14. Hypothesis Testing

### A) Hypothesis:

Create a null and alternate hypothesis to ask the following quesiton: Was the `Age` of survivors different from that of people who didn't survive?

**Hypotheses:**

$H_0$: The survival rate was no different for those under 20

$H_1$: The age of a passenger directly affected wether or not they survived

### B) T-Testing

Use a t-test to test your null hypothesis. What's the p-value? What's your interpretation? Do you accept or reject your null hypothesis? What does this mean in terms of `Age`?

In [183]:
import numpy as np
import matplotlib
from scipy import stats
ttest = stats.ttest_ind
titanic_df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeBin
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,"(20, 25]"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,"(35, 40]"
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,"(25, 30]"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,"(30, 35]"
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,"(30, 35]"


In [199]:
Lived = titanic_df[titanic_df['Survived'] == True]
Died = titanic_df[titanic_df['Survived'] == False]
ttest(Lived['Age'], Died['Age']) 

Ttest_indResult(statistic=-2.0865081090373168, pvalue=0.037217083726850342)

In [193]:
# Your code here
Survived = titanic_df['Survived'] == True


contingency_table = pd.crosstab(titanic_df.AgeBin, titanic_df.Survived)
contingency_table


 

Survived,0,1
AgeBin,Unnamed: 1_level_1,Unnamed: 2_level_1
"(0, 5]",13,31
"(5, 10]",13,7
"(10, 15]",8,11
"(15, 20]",63,33
"(20, 25]",80,42
"(25, 30]",66,42
"(30, 35]",47,41
"(35, 40]",39,28
"(40, 45]",30,17
"(45, 50]",23,16


Given a pvalue of .005 we can reject the null hypothesis, age had a direct effect on survival aboard the titanic 

# 14. Evaluation

Please use markdown cells to submit your responses. 

1. What was easy for you in this project?
Moderatley Difficult
2. What was difficult?
Changing the data types/correcting for Nan values and the hypothesis testing
3. Where did you make the most improvement?
reading in data and organizing it into the desired format
4. Where would you like to improve?
hypothesis testing