# Exercise 2

## Description

In this exercise, you will perform some rudimentary practices similar to those of
an actual data scientist.

But first, we will learn about some several of the main Python packages we are going to use.

## Packages and built-in functions

Python has a ton of packages that make doing complicated stuff very easy. We won't discuss how to install packages, or give a detailed list of what packages exist, but we will give a brief description about how they are used. 

An easy way to think of why package are useful is by thinking: "**Python packages give us access to MANY functions**".

Packages contain pre-defined functions (built-in) that make our life easier!  We've seen pre-defined functions before, for example, the funciton 'str()' that we used to convert numbers into strings in the Python Basics notebook.

In this class we will use two packages: `pandas` and `numpy`:

- **`pandas`** is a data manipulation package. It lets us store data in data frames. More on this soon.
- **`numpy`** (pronounced num-pie) is used for doing "math stuff", such as complex mathematical operations (e.g., square roots, exponents, logs), operations on matrices, and more. 

As we use these through the semester, their usefulness will become increasingly apparent.

#### Credits
This notebook is based on content taken from Udacity and Prof. Foster Provost course.

To make the contents of a package available, you need to import it:

In [1]:
import pandas
import numpy

Sometimes it is easier to use short names for packages. This has become the norm now, so let's do it sometimes so that you recognize it if you encounter it in your work.

In [2]:
import pandas as pd
import numpy as np

We can now use package-specific things. For example, numpy has a function called `sqrt()` which will give us the square root of a numpy number. Since it is part of numpy, we need to tell Python that that's where it is by using a dot (e.g., `np.sqrt()`).

In the following cell you can also see how to write **comments** in your code. Take my advice: write comments as you go.  It's helpful when you want to collaborate, then you don't have to figure out what you did to explain it to your collaborator.  But even more: often you need to come back to an analysis weeks, months, or even years later, and you will thank yourself for explaining what you did!

For storing numbers, strings and other objects, we can use lists, dictionaries and sets.
Lists represents a sequence of elements in a certain order while Dictionaries map keys to strings

In [3]:

some_list = [0,0,1,2,3,3,4.5,7.6]
some_dictionary = {'student1': '(929)-000-0000', 'student2': '(917)-000-0000', 'student3': '(470)-000-0000'}


# In this part of the code I am using numpy (np) functions

print ("Square root: " + str ( np.sqrt(25) ))
print ("Maximum element of our previous list: " + str( np.max(some_list) ))

# In this part of the code I am using python functions

print ("Number of elements in our previous list: " + str( len(some_list) ))
print ("Sum of elements in our previous list: " + str( sum(some_list) ))
print ("Range of 5 numbers (remember we start with 0): " + str( range(5) ))


Square root: 5.0
Maximum element of our previous list: 7.6
Number of elements in our previous list: 8
Sum of elements in our previous list: 21.1
Range of 5 numbers (remember we start with 0): range(0, 5)


What about the package **Pandas**? 

Pandas gives us the **DATAFRAME** -- one of the main data structures used in data analytics.

A Dataframe is 2-dimensional "labeled" data structure with columns of potentially different types. It is generally the most commonly used pandas object. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. [More details here](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

Let's build a dataframe called `data_pandas`, starting by creating some data records (the lists), and then putting them into a dataframe, with the columns labeled:

In [4]:
list1 = ['studentA',22,'(929)-000-000']
list2 = ['studentB',np.nan,'(646)-000-000']
list3 = ['studentC',30,'(917)-000-000']
list4 = ['studentD',31,'(646)-001-001']
list5 = ['studentE',np.nan,'(929)-001-001']
list6 = ['studentF',30,'(917)-001-001']
list7 = ['studentG',30,'(470)-001-001']

data_pandas = pd.DataFrame([list1,list2,list3,list4,list5,list6,list7],columns=['Name','Age','Mobile'])
data_pandas

Unnamed: 0,Name,Age,Mobile
0,studentA,22.0,(929)-000-000
1,studentB,,(646)-000-000
2,studentC,30.0,(917)-000-000
3,studentD,31.0,(646)-001-001
4,studentE,,(929)-001-001
5,studentF,30.0,(917)-001-001
6,studentG,30.0,(470)-001-001


Now: Count the number of rows with different Ages:

In [5]:
data_pandas.groupby('Age').count()

Unnamed: 0_level_0,Name,Mobile
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
22.0,1,1
30.0,3,3
31.0,1,1


We can select a subset of the columns by passing a list with the corresponding column names:

In [6]:
data_pandas[ ['Name','Age'] ]

Unnamed: 0,Name,Age
0,studentA,22.0
1,studentB,
2,studentC,30.0
3,studentD,31.0
4,studentE,
5,studentF,30.0
6,studentG,30.0


We can also add columns ( they should have the same number of rows as the dataframe they are being added to! )

In [7]:

data_pandas['business_major'] = ['yes','no','yes','yes','yes','no','yes']
data_pandas['years_experience'] = [1,4,2,6,0,3,0]

data_pandas


Unnamed: 0,Name,Age,Mobile,business_major,years_experience
0,studentA,22.0,(929)-000-000,yes,1
1,studentB,,(646)-000-000,no,4
2,studentC,30.0,(917)-000-000,yes,2
3,studentD,31.0,(646)-001-001,yes,6
4,studentE,,(929)-001-001,yes,0
5,studentF,30.0,(917)-001-001,no,3
6,studentG,30.0,(470)-001-001,yes,0


What if we take a look again? But now let's use "sum" to see all values, not just counts ( sum / aggregate )

In [8]:
data_pandas.groupby('Age').sum()

Unnamed: 0_level_0,years_experience
Age,Unnamed: 1_level_1
22.0,1
30.0,5
31.0,6


****

What is the average age? (Combine packages: numpy and pandas)


In [9]:
np.mean( data_pandas['Age'] )

28.600000000000001


What about operations on entire columns? This can make data munging much easier!

Let's take the difference between age and years of experience:

_(Look how I select columns here!)_


In [10]:
data_pandas.Age - data_pandas.years_experience


0    21.0
1     NaN
2    28.0
3    25.0
4     NaN
5    27.0
6    30.0
dtype: float64

### Reading files

There are many ways to read data in Python. In this exercise we will use Pandas to read data into dataframes



In [11]:
df = pd.read_csv('https://raw.githubusercontent.com/AUP-CS2091/Week-2/master/titanic-data.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


We can also get basic statistics on the data frame.

In [12]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


You can ignore **std** for now. The 25%, 50% and 75% values refer to the different percentiles.
     

#### Extra added bonus!!   ---  Auto-complete
One of the most useful things about IPython notebook is its tab completion. 

Try this: put your cursor just after `sqrt(` in the cell below and press `Shift + Tab` 4 times, slowly

In [13]:
np.sqrt(4)

2.0

I find this amazingly useful. I think of this as "the more confused I am, the more times I should press Shift+Tab". Nothing bad will happen if you tab complete 12 times.
Okay, let's try tab completion for function names! Just hit Tab when typing below to get suggestions.

In [14]:
np.sqrt(45164161635)

212518.61479644553

This is super useful when (like me) you forget the names of everything!

## Hands-on

We now arrive to some practical exercises. We will first try to do some basic Python coding and then go ahead and try 
to write Python code in order to predict if passengers survived the Titanic diaster. 

**1\. Create one list of 5 fruits and another one with 5 colors**

In [15]:
fruits = ['orange', 'apple', 'banana', 'strawberry', 'peach']
colors = ['orange', 'red', 'yellow', 'red', 'gold']
fruits, colors

(['orange', 'apple', 'banana', 'strawberry', 'peach'],
 ['orange', 'red', 'yellow', 'red', 'gold'])

**2\. Go through each fruit (first list) and print out the name of the fruit with one color of the second list **

(don't worry, it doesn't have to be the color of the fruit!)

Example of what you should print:  _apple is purple_

In [16]:
for fruit, color in zip(fruits, colors):
    print ("The " + fruit + " is " + color)

The orange is orange
The apple is red
The banana is yellow
The strawberry is red
The peach is gold


**3\. Add two new fruits to your list with a _BUILT-IN_ function **

( Look for the function with the **TAB** hint! )

In [17]:
fruits.extend(("cherry", "tomato"))
fruits

['orange', 'apple', 'banana', 'strawberry', 'peach', 'cherry', 'tomato']

**4\. Use the list of fruits and sort the names (put them in alphabetical order) **

( Hint: Numpy has a great function for that!)

In [18]:
import numpy as np
f = np.sort(fruits)
print (f)

['apple' 'banana' 'cherry' 'orange' 'peach' 'strawberry' 'tomato']


## Modeling chances for surviving the sinking of the Titanic

Part of a data scientist's job is to use her or his intuition and insight to
write algorithms and heuristics. A data scientist also creates mathematical models 
to make predictions based on some attributes from the data that they are examining.

You need to take your knowledge and intuition about the Titanic
and its passengers' attributes to predict whether or not the passengers survived
or perished. You can read more about the Titanic and specifics about this dataset at:
http://en.wikipedia.org/wiki/RMS_Titanic
http://www.kaggle.com/c/titanic-gettingStarted

In this exercise, you are given a list of Titantic passengers
and their associated information. More information about the data can be seen at the 
link below:
http://www.kaggle.com/c/titanic-gettingStarted/data. 

For this exercise, you need to write a heuristic that will use
the passengers' information to predict if that person survived the Titanic disaster.

You prediction should be 79% accurate or higher.

For maximal score, you should obtain above 80% of accuracy

Here's a simple heuristic to start off:
   1) If the passenger is female, your heuristic should assume that the
   passenger survived.
   2) If the passenger is male, you heuristic should
   assume that the passenger did not survive.

You can access the gender of a passenger via passenger['Sex'].
If the passenger is male, passenger['Sex'] will return a string "male".
If the passenger is female, passenger['Sex'] will return a string "female".

Write your prediction back into the "predictions" dictionary. The
key of the dictionary should be the passenger's id (which can be accessed
via passenger["PassengerId"]) and the associated value should be 1 if the
passenger survied or 0 otherwise.

For example, if a passenger is predicted to have survived:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 1

And if a passenger is predicted to have perished in the disaster:
passenger_id = passenger['PassengerId']
predictions[passenger_id] = 0


## Data

The available attributes are:

    Pclass          Passenger Class
                    (1 = 1st; 2 = 2nd; 3 = 3rd)
    Name            Name
    Sex             Sex
    Age             Age
    SibSp           Number of Siblings/Spouses Aboard
    Parch           Number of Parents/Children Aboard
    Ticket          Ticket Number
    Fare            Passenger Fare
    Cabin           Cabin
    Embarked        Port of Embarkation
                    (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socioeconomic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in years; fractional if age less than one
If the age is estimated, it is in the form xx.5

With respect to the family relation variables (i.e. SibSp and Parch)
some relations were ignored. The following are the definitions used
for SibSp and Parch.

    Sibling:  brother, sister, stepbrother, or stepsister of passenger aboard Titanic
    Spouse:   husband or wife of passenger aboard Titanic (mistresses and fiancees ignored)
    Parent:   mother or father of passenger aboard Titanic
    Child:    son, daughter, stepson, or stepdaughter of passenger aboard Titanic


In [23]:
import pandas as pd
import sklearn.metrics as mts

def predict(df):
    predictions = {}
    for passenger_index, passenger in df.iterrows():
        passenger_id = passenger['PassengerId']
        #your code here
        if passenger['SibSp'] < 3 and passenger['Parch'] < 3:
            if passenger['Sex'] == 'female':
                predictions[passenger_id] = 1
            else:
                predictions[passenger_id] = 0
        else:
            if passenger['Pclass'] < 3:
                predictions[passenger_id] = 1
            else:
                predictions[passenger_id] = 0
        #end of code
    return predictions

predictions = predict(df)
predictions

{1: 0,
 2: 1,
 3: 1,
 4: 1,
 5: 0,
 6: 0,
 7: 0,
 8: 0,
 9: 1,
 10: 1,
 11: 1,
 12: 1,
 13: 0,
 14: 0,
 15: 1,
 16: 1,
 17: 0,
 18: 0,
 19: 1,
 20: 1,
 21: 0,
 22: 0,
 23: 1,
 24: 0,
 25: 0,
 26: 0,
 27: 0,
 28: 1,
 29: 1,
 30: 0,
 31: 0,
 32: 1,
 33: 1,
 34: 0,
 35: 0,
 36: 0,
 37: 0,
 38: 0,
 39: 1,
 40: 1,
 41: 1,
 42: 1,
 43: 0,
 44: 1,
 45: 1,
 46: 0,
 47: 0,
 48: 1,
 49: 0,
 50: 1,
 51: 0,
 52: 0,
 53: 1,
 54: 1,
 55: 0,
 56: 0,
 57: 1,
 58: 0,
 59: 1,
 60: 0,
 61: 0,
 62: 1,
 63: 0,
 64: 0,
 65: 0,
 66: 0,
 67: 1,
 68: 0,
 69: 0,
 70: 0,
 71: 0,
 72: 0,
 73: 0,
 74: 0,
 75: 0,
 76: 0,
 77: 0,
 78: 0,
 79: 0,
 80: 1,
 81: 0,
 82: 0,
 83: 1,
 84: 0,
 85: 1,
 86: 0,
 87: 0,
 88: 0,
 89: 1,
 90: 0,
 91: 0,
 92: 0,
 93: 0,
 94: 0,
 95: 0,
 96: 0,
 97: 0,
 98: 0,
 99: 1,
 100: 0,
 101: 1,
 102: 0,
 103: 0,
 104: 0,
 105: 0,
 106: 0,
 107: 1,
 108: 0,
 109: 0,
 110: 1,
 111: 0,
 112: 1,
 113: 0,
 114: 1,
 115: 1,
 116: 0,
 117: 0,
 118: 0,
 119: 0,
 120: 0,
 121: 0,
 122: 0,
 123: 0,
 

Once you have finished implementing the function `predict`, you can use the next code cell in order to check your accuracy.

In [22]:
real_rate = df[['PassengerId','Survived']]
real_rate = real_rate.set_index('PassengerId').T.to_dict(orient='list')
real_rate = {x: real_rate[x][0] for x in real_rate} # list comprehension
mts.accuracy_score(list(real_rate.values()), list(predictions.values()))

0.80359147025813693