# BIG DIVE Intesa 3
## Data Science and Scikit-learn
by Stefania Delprete, TOP-IX  
stefania.delprete@top-ix.org 

https://www.linkedin.com/in/astrastefania   
https://twitter.com/astrastefania  

---

## Exploring a big Effective Altruism survey

Effective Altruism https://www.effectivealtruism.com

In [1]:
import pandas as pd

In [2]:
survey = pd.read_csv('../../BDINTESA3/Data_Science/data/EA_imsurvey2017-anonymized-currencied.csv')

In [3]:
survey.shape

(2521, 218)

In [4]:
survey.head(5)

Unnamed: 0,id,referrer_url,heard_ea,is_ea,is_ea_comment,sincere,cause_import_animal_welfare,cause_import_cause_prioritization,cause_import_environmentalism,cause_import_ai,...,donate_sp_2016_c,donate_thl_2015_c,donate_thl_2016_c,donate_tlycs_2015_c,donate_tlycs_2016_c,plan_donate_how_much_c,income_2015_household_c,income_2015_individual_c,income_2016_household_c,income_2016_individual_c
0,,,,,,,,,,,...,,,,,,,,,,
1,"\t """,EAs_sometimes_think_some_causes_are_more_impor...,EAs_sometimes_think_some_causes_are_more_impor...,EAs_sometimes_think_some_causes_are_more_impor...,EAs_sometimes_think_some_causes_are_more_impor...,Comments?,Do_you_identify_with_any_other_social_movement...,Do_you_identify_with_any_other_social_movement...,Do_you_identify_with_any_other_social_movement...,Do_you_identify_with_any_other_social_movement...,...,,,,,,,,,,
2,9,http://survey.effectivealtruismhub.com/index.p...,Yes,Yes,But I am concerned EA looks cultish from the o...,Yes (pick this option to have your answers cou...,This cause should be the top priority,This cause should be a near-top priority,"I do not think this is a priority, but I am gl...","I do not think this is a priority, but I am gl...",...,,,,,,,15908.37,15908.37,15908.37,15908.37
3,10,http://survey.effectivealtruismhub.com/index.p...,Yes,Yes,,Yes (pick this option to have your answers cou...,This cause should be a near-top priority,This cause should be a near-top priority,This cause deserves significant investment but...,This cause should be the top priority,...,,,,,,1.5,0.0,0.0,0.0,0.0
4,11,http://survey.effectivealtruismhub.com/index.p...,Yes,Yes,,Yes (pick this option to have your answers cou...,,,,,...,,,,,,,,,,


In [6]:
# A lot of variables/columns!
survey.columns

Index(['id', 'referrer_url', 'heard_ea', 'is_ea', 'is_ea_comment', 'sincere',
       'cause_import_animal_welfare', 'cause_import_cause_prioritization',
       'cause_import_environmentalism', 'cause_import_ai',
       ...
       'donate_sp_2016_c', 'donate_thl_2015_c', 'donate_thl_2016_c',
       'donate_tlycs_2015_c', 'donate_tlycs_2016_c', 'plan_donate_how_much_c',
       'income_2015_household_c', 'income_2015_individual_c',
       'income_2016_household_c', 'income_2016_individual_c'],
      dtype='object', length=218)

In [7]:
# Let's choose 5 columns, and remove the first two rows
survey_ = survey.loc[2:,['id', 'is_ea', 'student', 'employment_status', 'field', 'education',]]
survey_.head()

Unnamed: 0,id,is_ea,student,employment_status,field,education
2,9,Yes,Yes,,,Master’s degree
3,10,Yes,Yes,,,Undergraduate degree (bachelor’s)
4,11,Yes,,,,
5,12,,,,,
6,13,Yes,,,,


In [8]:
survey_.shape

(2519, 6)

### Exploring values

In [None]:
survey_['is_ea'].count()

In [None]:
survey_['is_ea'].unique() # .unique() shows the unique values

In [None]:
survey_['employment_status'].unique() 

In [None]:
survey_['employment_status'].nunique() # .unique() shows the number of unique values

In [None]:
survey_['field'].unique(), survey_['field'].nunique()

### Counting values 
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html

In [None]:
survey_['field'].value_counts() # .value_counts() counts the number of observations for each unique value

In [None]:
survey_['employment_status'].value_counts(dropna=False) # The argument dropna=False show us the null values too

## Drop null values

We can clean our pandas DataFrame from null values in different ways

In [None]:
survey_['employment_status'].head(15)

In [None]:
value_index3 = survey_['employment_status'][3]

In [None]:
value_index3, type(value_index3)

We can use `.dropna()`to delete all the rows with a null values  
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

In [None]:
survey_['employment_status'].dropna().head(15)

Or we can fill the null value with a custom value with `.fillna()`  
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.fillna.html

In [None]:
survey_['employment_status'].fillna('no answer').head(15)

## Grouping by a column/variable with `.groupby()`

In [None]:
field = survey.groupby(['field']).mean() # 'field' becomes the index

In [None]:
field.head()

## Make a graph to put all together

In [None]:
import matplotlib.pyplot as plt

In [None]:
def donation_2015on2016(field):
    survey_ = survey[survey['field'] == field]
    
    groups = survey_.groupby('employment_status')

    fig, ax = plt.subplots(figsize=(8,5))
    ax.margins(0.05)

    for status, group in groups:
        ax.plot(group['donate_2015_c'], group['donate_2016_c'], marker='.', linestyle='', alpha=0.4, ms=12, label=status)
    
    ax.legend()

    ax.set_xlabel('Donations made in 2015 (USD)')
    ax.set_ylabel('Donations made in 2016 (USD)')
    ax.set_title('Donation from people working in ' + field)

    plt.show()

In [None]:
donation_2015on2016('Computers (Practical: IT, programming, etc.)')

In [None]:
donation_2015on2016('Other')

In [None]:
donation_2015on2016('Business (non-EA)')

In [None]:
donation_2015on2016('Finance')

In [None]:
survey[survey['field'] == 'Finance'].loc[:, ['id', 'donate_2015_c', 'donate_2016_c']].head()

---
### `>>> Let's practice` 
Interactive session in groups, going back on the the American time usage.  
Source https://www.ibm.com/communities/analytics/watson-analytics-blog/american-time-use-survey

1. As a group explore and decide the major hypothesis and insights you want to deepen
2. Use Matplotlib and Seaborn (and Pandas if necessary) to visualise and rapprensent hypothesis and insights
3. Choose one particular exploration and insight you want to share with the rest of the class

In [None]:
time_survey = pd.read_csv('data/WA_American-Time-Use-Survey-lite.csv')

In [None]:
# 1  
# Using Pandas to decide what to explore (you can merge together the individual work of last week)

In [None]:
# 2
# Start exploring the data with visualisations

In [None]:
# 3
# Choose one insight to dig deeper and to share later with the rest of the class