## Learning Objectives

- Learn all of the methods in pandas for data-frame manipulation
- The dataset we use is Titanic dataset
- Apply visualization to data-frame

### Lets make Pandas dataframe from titanic csv file 

In [1]:
import numpy as np 
from pyspark import SparkContext

sc = SparkContext()

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark regression example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [3]:
df = spark.read.csv('titanic.csv', header=True, inferSchema=True)

### Lets look at the first 5 rows of dataframe

In [4]:
df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

In [5]:
print("Shape:", (df.count(), len(df.columns)))

Shape: (891, 12)


### Titanic Dataset Description

VARIABLE DESCRIPTIONS:  
survival        Survival  
                (0 = No; 1 = Yes)  
pclass          Passenger Class  
                (1 = 1st; 2 = 2nd; 3 = 3rd)  
name            Name  
sex             Sex  
age             Age  
sibsp           Number of Siblings/Spouses Aboard  
parch           Number of Parents/Children Aboard  
ticket          Ticket Number  
fare            Passenger Fare  
cabin           Cabin  
embarked        Port of Embarkation  
                (C = Cherbourg; Q = Queenstown; S = Southampton)  

### Plot how many of the passengers were children, youth, middle age and old?

In [7]:
import matplotlib.pyplot as plt
from seaborn import distplot


df.groupby("Age").count().show()

+----+-----+
| Age|count|
+----+-----+
| 8.0|    4|
|70.0|    2|
| 7.0|    3|
|20.5|    1|
|49.0|    6|
|29.0|   20|
|40.5|    2|
|64.0|    2|
|47.0|    9|
|42.0|   13|
|24.5|    1|
|44.0|    9|
|35.0|   18|
|null|  177|
|62.0|    4|
|18.0|   26|
|80.0|    1|
|34.5|    1|
|39.0|   14|
| 1.0|    7|
+----+-----+
only showing top 20 rows



### How many of Age values are empty (or null)?

In [8]:
from pyspark.sql.functions import isnan, when, count, col

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+
|          0|       0|     0|   0|  0|177|    0|    0|     0|   0|  687|       2|
+-----------+--------+------+----+---+---+-----+-----+------+----+-----+--------+



### Create a new column as gender, when Sex is female it is zero when sex is male it is one

In [11]:
df = df.withColumn("Gender", when(df['Sex'] == "male", 1).otherwise(0))
df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Gender|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|     1|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|     0|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|     0|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|     0|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|     1|
+-----------+--------+--

### We have one more column (check it)

In [13]:
(df.count(), len(df.columns))

(891, 13)

### Show the majority of Age range

In [22]:
df.groupby("Age").count().sort('count', ascending=False).show(5)

+----+-----+
| Age|count|
+----+-----+
|null|  177|
|24.0|   30|
|22.0|   27|
|18.0|   26|
|30.0|   25|
+----+-----+
only showing top 5 rows



### List all of the Ages that are not null

In [23]:
df.where(col("Age").isNotNull()).show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|Gender|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| null|       S|     1|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|     0|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| null|       S|     0|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|     0|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| null|       S|     1|
+-----------+--------+--

### Slice the dataframe for those whose Embarked section was 'C'

In [27]:
c_embark_df = df.filter(df['Embarked'] == "C")
c_embark_df.show(5)

+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----+--------+------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|  Ticket|   Fare|Cabin|Embarked|Gender|
+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----+--------+------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|PC 17599|71.2833|  C85|       C|     0|
|         10|       1|     2|Nasser, Mrs. Nich...|female|14.0|    1|    0|  237736|30.0708| null|       C|     0|
|         20|       1|     3|Masselmani, Mrs. ...|female|null|    0|    0|    2649|  7.225| null|       C|     0|
|         27|       0|     3|Emir, Mr. Farred ...|  male|null|    0|    0|    2631|  7.225| null|       C|     1|
|         31|       0|     1|Uruchurtu, Don. M...|  male|40.0|    0|    0|PC 17601|27.7208| null|       C|     1|
+-----------+--------+------+--------------------+------+----+-----+-----+--------+-----

### Plot the Age range for those whose Embraked were 'C'

In [28]:
c_embark_df.groupby("Age").count().sort('count', ascending=False).show(5)

+----+-----+
| Age|count|
+----+-----+
|null|   38|
|30.0|    7|
|24.0|    7|
|22.0|    5|
|17.0|    5|
+----+-----+
only showing top 5 rows



### Describe a specific column 

In [30]:
df.describe(['Embarked']).show()

+-------+--------+
|summary|Embarked|
+-------+--------+
|  count|     889|
|   mean|    null|
| stddev|    null|
|    min|       C|
|    max|       S|
+-------+--------+



### How many unique values does the 'Embraked' have?

In [31]:
df.select('Embarked').distinct().dropna().count()

3

### Count the different 'Embarked' values the dataframe has

In [37]:
df.groupby('Embarked').count().show(4)

+--------+-----+
|Embarked|count|
+--------+-----+
|       Q|   77|
|    null|    2|
|       C|  168|
|       S|  644|
+--------+-----+



### Count the different 'Embarked' values the dataframe has and plot horizontaly

In [None]:
df['Embarked'].value_counts().plot('barh').invert_yaxis()


### Another way to do the count and plot it

In [None]:
import seaborn as sns


# Bar Chart Example #1 (Simple): Categorical Variables Showing Counts
sns.countplot(x="Embarked", palette="spring", data=df)


In [None]:
df['Embarked'].value_counts()

In [None]:
df['Sex'].value_counts().to_json()

In [None]:
df['Sex'].value_counts().plot(kind='bar')

In [None]:
df['Sex'].value_counts().plot(kind='pie')

### Plot how many of the passengers were children, youth, middle age and old based on there Sex for those who 'Embarked' in section 'C'?

In [None]:
for i in df[df['Embarked'] == 'C'].groupby('Sex')['Age']:
    print(i)
    

In [None]:
df[df['Embarked'] == 'C'].groupby('Sex')['Age'].hist(bins=16, alpha=0.5)

In [None]:
df[df['Embarked'] == 'C'].groupby('Sex')['Age'].plot(bins=16, kind='hist', legend=True, alpha=0.5)

In [None]:
df[df['Embarked'] == 'C'].groupby('Sex')['Age'].value_counts()


In [None]:
# # import the pandas library
# import pandas as pd
# import numpy as np

# ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
#          'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
#          'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
#          'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
#          'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
# df = pd.DataFrame(ipl_data)

# grouped = df.groupby('Year')
# df.groupby('Year')['Points'].agg(np.mean)

# https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm

### What is the average Age for female and male (based on sex) for those who have 'Embarked' on section 'C'?

In [None]:
df[df['Embarked'] == 'C'].groupby('Sex')['Age'].agg(np.mean)

### Another way we can do the above task

In [None]:
df[df['Embarked'] == 'C'].groupby('Sex')['Age'].apply(lambda x:np.mean(x))

### Which Age is the oldest for female and male (based on sex) for those who have 'Embarked' on section 'C'?

In [None]:
df[df['Embarked'] == 'C'].groupby('Sex')['Age'].agg(np.max)

### For different Ages, plot the Fare they have paid?

In [None]:
sns.regplot(x="Age", y="Fare", fit_reg=False, data=df)

In [None]:
df.plot.scatter(x="Age", y="Fare")

### Plot how percentage Survived for two Sex group based on the passengers class 

In [None]:
sns.barplot(x="Sex", y="Survived", hue="Pclass", data=df)

### Plot how many male or female were in different Passenger classes

In [None]:
sns.countplot(x="Sex", hue="Pclass", data=df)

In [None]:
import seaborn as sns
sns.countplot(x="Sex", hue="Survived", data=df)

In [None]:
pd.crosstab(df['Sex'], df['Survived']).to_json()

### Verify values obtained for pertentage 

In [None]:
df[(df['Sex'] == 'female') & (df['Pclass'] == 1)]['Survived'].value_counts()

In [None]:
91/(91 + 3)

In [None]:
dict(df[(df['Sex'] == 'female') & (df['Pclass'] == 1)]['Survived'].value_counts())

### Stack plot of count based on Sex for different Passenger Class

In [None]:
df.groupby(['Sex'])['Pclass'].value_counts().unstack().plot(kind='bar',stacked=True)

### Stack plot of count based on Sex and Survival for different Passenger Class

In [None]:
df.groupby(['Sex', 'Survived'])['Pclass'].value_counts().unstack().plot(kind='bar',stacked=True)

### Sometimes it is hard to read values from plot, what are the number of female and male at each Passenger Class

In [None]:
# df.groupby(['Sex'])['Pclass'].value_counts().unstack()
# the above and crosstab are the same 
pd.crosstab(df['Sex'], df['Pclass'])

In [None]:
pd.crosstab(df['Sex'], df['Survived'])

In [None]:
pd.crosstab(df['Sex'], df['Embarked'])

### How to represent the above cross tab in percentage and graphically present 

In [None]:
sns.heatmap(pd.crosstab(df['Sex'], df['Embarked'], normalize='index'), cmap="YlGnBu", annot=True)

## Question:

What percent of passengers embarked at C?

In [None]:
# Answer:

print(dict(df['Embarked'].value_counts()))

dict(df['Embarked'].value_counts())['C']

In [None]:
sum(dict(df['Embarked'].value_counts()).values())

In [None]:
dict(df['Embarked'].value_counts())['C']/sum(dict(df['Embarked'].value_counts()).values())

#### OR

In [None]:
len(df[df['Embarked'] == 'C'])/len(df['Embarked'].dropna())

What percent of female passengers embarked at C?

In [None]:
pd.crosstab(df['Sex'], df['Embarked'])

In [None]:
len(df[(df['Sex'] == 'female') & (df['Embarked'] == 'C')])

In [None]:
len(df[df['Sex'] == 'female'])

In [None]:
73/ 314

In [None]:
len(df[(df['Sex'] == 'female') & (df['Embarked'] == 'C')])/len(df[df['Sex'] == 'female'])

This question is different from above:
What percent of passengers embarked at C were female?