# Intro to Data Analysis using Pandas

## Plan

* read in dataset
* getting some immediate information about the data
* subset the data
    * by column
    * by row
    * getting max values locations (`.idxmin()`, `idxmax()`)
* sort the data
* make changes to the data
    * new column names
    * change column values
    * create new columns
* create summary stats
    * for the whole dataset
    * for groups
* dealing with missing values

### Read in a dataset

In [1]:
import pandas as pd

In [2]:
demographic_data = pd.read_csv("data/life_expectancy_and_income.csv")

### getting some immediate information about the data

* what columns there are
    * `.columns`
* how many rows, columns
    * `.shape`
* what the top of the data looks like
    * `.head()`
* info about missing values
    * `.info()`
* info about the distribution of the data
    * `.describe()`

In [3]:
demographic_data.columns

Index(['country', 'year', 'fertility_rate', 'income_per_person',
       'life_expectancy'],
      dtype='object')

In [4]:
demographic_data.shape

(22080, 5)

In [5]:
demographic_data.head()

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.0,1090,29.4
1,Afghanistan,1901,7.0,1110,29.5
2,Afghanistan,1902,7.0,1120,29.5
3,Afghanistan,1903,7.0,1140,29.6
4,Afghanistan,1904,7.0,1160,29.7


In [7]:
demographic_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22080 entries, 0 to 22079
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   country            22080 non-null  object 
 1   year               22080 non-null  int64  
 2   fertility_rate     22080 non-null  float64
 3   income_per_person  22080 non-null  int64  
 4   life_expectancy    22080 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 862.6+ KB


In [6]:
# by default describe only gives info about numeric data
demographic_data.describe()

Unnamed: 0,year,fertility_rate,income_per_person,life_expectancy
count,22080.0,22080.0,22080.0,22080.0
mean,1959.5,4.84077,7607.700996,52.567773
std,34.640598,1.916428,13448.4086,16.773059
min,1900.0,1.12,312.0,1.1
25%,1929.75,2.98,1370.0,35.7
50%,1959.5,5.45,2880.0,53.55
75%,1989.25,6.5,7702.5,68.2
max,2019.0,8.87,179000.0,85.1


In [8]:
# include info about all data
demographic_data.describe(include="all")

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
count,22080,22080.0,22080.0,22080.0,22080.0
unique,184,,,,
top,Afghanistan,,,,
freq,120,,,,
mean,,1959.5,4.84077,7607.700996,52.567773
std,,34.640598,1.916428,13448.4086,16.773059
min,,1900.0,1.12,312.0,1.1
25%,,1929.75,2.98,1370.0,35.7
50%,,1959.5,5.45,2880.0,53.55
75%,,1989.25,6.5,7702.5,68.2


### Exercise

For the exercises in this lesson we'll use a dataset about occupational prestige. Here is some information on the variables contained in the data.

* `education` Average years of education of occupational incumbents, years, in 1971.
* `income` Average income of incumbents, dollars, in 1971.
* `women` Percentage of incumbents who are women.
* `prestige` Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
* `census` Canadian Census occupational code.
* `type` Type of occupation. A factor with levels (note: out of order): bc, Blue Collar; prof, Professional, Managerial, and Technical; wc, White Collar.

1. Read in the file prestige_occupation.csv. Save it as an object called occupation_prestige.
2. How many rows and columns are in the data?
3. What is the average (mean) value for the column prestige (hint: you can use describe to answer this).

In [9]:
occupation_prestige = pd.read_csv("data/prestige_occupation.csv")

In [10]:
occupation_prestige.shape

(102, 7)

In [11]:
occupation_prestige.describe()

Unnamed: 0,education,income,women,prestige,census
count,102.0,102.0,97.0,102.0,102.0
mean,10.738039,6797.901961,30.472784,46.833333,5401.77451
std,2.728444,4245.922227,31.826063,17.204486,2644.993215
min,6.38,611.0,0.52,14.8,1113.0
25%,8.445,4106.0,4.14,35.225,3120.5
50%,10.54,5930.5,15.51,43.6,5135.0
75%,12.6475,8187.25,54.77,59.275,8312.5
max,15.97,25879.0,97.51,87.2,9517.0


In [12]:
occupation_prestige.head(10)

Unnamed: 0,job,education,income,women,prestige,census,type
0,gov.administrators,13.11,12351,11.16,68.8,1113,prof
1,general.managers,12.26,25879,4.02,69.1,1130,prof
2,accountants,12.77,9271,15.7,63.4,1171,prof
3,purchasing.officers,11.42,8865,9.11,56.8,1175,prof
4,chemists,14.62,8403,11.68,73.5,2111,prof
5,physicists,15.64,11030,5.13,77.6,2113,prof
6,biologists,15.09,8258,25.65,72.6,2133,prof
7,architects,15.44,14163,2.69,78.1,2141,prof
8,civil.engineers,14.52,11377,1.03,73.1,2143,prof
9,mining.engineers,14.64,11023,0.94,68.8,2153,prof


In [13]:
occupation_prestige.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   job        102 non-null    object 
 1   education  102 non-null    float64
 2   income     102 non-null    int64  
 3   women      97 non-null     float64
 4   prestige   102 non-null    float64
 5   census     102 non-null    int64  
 6   type       98 non-null     object 
dtypes: float64(3), int64(2), object(2)
memory usage: 5.7+ KB


## Subsetting the data

* by attribute
* using `[""]`

In [14]:
demographic_data.country

0        Afghanistan
1        Afghanistan
2        Afghanistan
3        Afghanistan
4        Afghanistan
            ...     
22075       Zimbabwe
22076       Zimbabwe
22077       Zimbabwe
22078       Zimbabwe
22079       Zimbabwe
Name: country, Length: 22080, dtype: object

In [15]:
demographic_data[["country", "year"]]

Unnamed: 0,country,year
0,Afghanistan,1900
1,Afghanistan,1901
2,Afghanistan,1902
3,Afghanistan,1903
4,Afghanistan,1904
...,...,...
22075,Zimbabwe,2015
22076,Zimbabwe,2016
22077,Zimbabwe,2017
22078,Zimbabwe,2018


One column of data is a pandas series.
<br>
Multiple columns (i.e. a table) is a pandas DataFrame
<br>
Other (and preferred for `filtering` subsets is `.loc`
<br>
```python
df.loc[rows, columns]
```
<br>
all the rows, but for select columns

In [21]:
demographic_data.loc[:, ["country", "year", "income_per_person"]]

Unnamed: 0,country,year,income_per_person
0,Afghanistan,1900,1090
1,Afghanistan,1901,1110
2,Afghanistan,1902,1120
3,Afghanistan,1903,1140
4,Afghanistan,1904,1160
...,...,...,...
22075,Zimbabwe,2015,2510
22076,Zimbabwe,2016,2490
22077,Zimbabwe,2017,2570
22078,Zimbabwe,2018,2620


In [22]:
# get first 20 rows
demographic_data.loc[0:20, ["country", "year", "income_per_person"]]

Unnamed: 0,country,year,income_per_person
0,Afghanistan,1900,1090
1,Afghanistan,1901,1110
2,Afghanistan,1902,1120
3,Afghanistan,1903,1140
4,Afghanistan,1904,1160
5,Afghanistan,1905,1180
6,Afghanistan,1906,1200
7,Afghanistan,1907,1220
8,Afghanistan,1908,1240
9,Afghanistan,1909,1260


### Filtering Data (subsetting rows)

If we want to return only the rows relating to Japan

In [23]:
# returns a series as long as the data with `True` and `False`
demographic_data.country == "Japan"

0        False
1        False
2        False
3        False
4        False
         ...  
22075    False
22076    False
22077    False
22078    False
22079    False
Name: country, Length: 22080, dtype: bool

In [26]:
mask = demographic_data.country == "Japan"

# [False, False, False, ..., True, True, True, ..., False,...]

demographic_data.loc[mask]

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
9720,Japan,1900,4.69,1860,38.7
9721,Japan,1901,5.01,1900,38.8
9722,Japan,1902,4.97,1780,39.0
9723,Japan,1903,4.83,1880,39.1
9724,Japan,1904,4.61,1870,39.2
...,...,...,...,...,...
9835,Japan,2015,1.44,37800,84.1
9836,Japan,2016,1.46,38100,84.2
9837,Japan,2017,1.47,38900,84.2
9838,Japan,2018,1.48,39300,84.4


In [28]:
# combining row filtering and subsetting
demographic_data.loc[mask, ["country", "year", "life_expectancy"]]

Unnamed: 0,country,year,life_expectancy
9720,Japan,1900,38.7
9721,Japan,1901,38.8
9722,Japan,1902,39.0
9723,Japan,1903,39.1
9724,Japan,1904,39.2
...,...,...,...
9835,Japan,2015,84.1
9836,Japan,2016,84.2
9837,Japan,2017,84.2
9838,Japan,2018,84.4


### Multiple Conditions in mask

In [29]:
mask = (demographic_data.country == "Japan") & (demographic_data.year > 2000)

demographic_data.loc[mask]

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
9821,Japan,2001,1.31,33900,81.7
9822,Japan,2002,1.3,33900,82.0
9823,Japan,2003,1.3,34300,82.1
9824,Japan,2004,1.3,35100,82.3
9825,Japan,2005,1.31,35700,82.3
9826,Japan,2006,1.32,36100,82.6
9827,Japan,2007,1.33,36700,82.8
9828,Japan,2008,1.34,36300,82.9
9829,Japan,2009,1.36,34300,83.1
9830,Japan,2010,1.37,35800,83.1


In [32]:
mask_verbose = (demographic_data.country == "Japan") | (demographic_data.country == "Italy")
demographic_data.loc[mask_verbose]

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
9480,Italy,1900,4.53,3780,41.7
9481,Italy,1901,4.49,3840,43.5
9482,Italy,1902,4.46,3900,43.0
9483,Italy,1903,4.43,3940,43.1
9484,Italy,1904,4.44,4010,44.4
...,...,...,...,...,...
9835,Japan,2015,1.44,37800,84.1
9836,Japan,2016,1.46,38100,84.2
9837,Japan,2017,1.47,38900,84.2
9838,Japan,2018,1.48,39300,84.4


In [33]:
mask_convenient = demographic_data.country.isin(["Japan", "Italy"])

demographic_data.loc[mask_convenient]

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
9480,Italy,1900,4.53,3780,41.7
9481,Italy,1901,4.49,3840,43.5
9482,Italy,1902,4.46,3900,43.0
9483,Italy,1903,4.43,3940,43.1
9484,Italy,1904,4.44,4010,44.4
...,...,...,...,...,...
9835,Japan,2015,1.44,37800,84.1
9836,Japan,2016,1.46,38100,84.2
9837,Japan,2017,1.47,38900,84.2
9838,Japan,2018,1.48,39300,84.4


#### To filter a pandas dataframe

1. build a mask
2. apply that to our data

In [34]:
demographic_data.loc[demographic_data.country.isin(["Japan", "Italy"])]

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
9480,Italy,1900,4.53,3780,41.7
9481,Italy,1901,4.49,3840,43.5
9482,Italy,1902,4.46,3900,43.0
9483,Italy,1903,4.43,3940,43.1
9484,Italy,1904,4.44,4010,44.4
...,...,...,...,...,...
9835,Japan,2015,1.44,37800,84.1
9836,Japan,2016,1.46,38100,84.2
9837,Japan,2017,1.47,38900,84.2
9838,Japan,2018,1.48,39300,84.4


### Minimum and Maximum Row Locations

* `.idxmin()`
* `.idxmax()`

In [35]:
# lowest life expectancy location

i = demographic_data.life_expectancy.idxmin()
i

16338

In [36]:
demographic_data.loc[i]

country              Samoa
year                  1918
fertility_rate        6.98
income_per_person     2050
life_expectancy        1.1
Name: 16338, dtype: object

In [37]:
# highest life expectancy

demographic_data.loc[demographic_data.life_expectancy.idxmax()]

country              Singapore
year                      2019
fertility_rate            1.27
income_per_person        90100
life_expectancy           85.1
Name: 17279, dtype: object

### Saving the Subsets

In [39]:
my_list = [0, 1, 2, 3, 4]
# instead of creating a whole new object with `=`
# python is giving the my_list object a new label it can be accessed by
my_list_view = my_list

my_list_view.append(10)

print(my_list)

[0, 1, 2, 3, 4, 10]


In [40]:
my_list_copy = my_list.copy()
my_list_copy.remove(10)
print(my_list)
print(my_list_copy)

[0, 1, 2, 3, 4, 10]
[0, 1, 2, 3, 4]


Get a table that contains the life expectancy data for Japan after the year 2000

In [45]:
# a neat way to layout pandas wrangling

japan_data = (demographic_data
              .loc[(demographic_data.country == "Japan") & (demographic_data.year > 2000), ["country", "year", "life_expectancy"]]
              .copy()
             )

In [46]:
japan_data

Unnamed: 0,country,year,life_expectancy
9821,Japan,2001,81.7
9822,Japan,2002,82.0
9823,Japan,2003,82.1
9824,Japan,2004,82.3
9825,Japan,2005,82.3
9826,Japan,2006,82.6
9827,Japan,2007,82.8
9828,Japan,2008,82.9
9829,Japan,2009,83.1
9830,Japan,2010,83.1


Task
<br>
Make a new data frame called, `job_incomes`, that has just the "job", "type" and "income" column from occupation_prestige.
<br>
Make sure the columns appear in the order we have given.

In [47]:
job_incomes = (
    occupation_prestige
    .loc[:, ["job", "type", "income"]]
    .copy()
)

job_incomes

Unnamed: 0,job,type,income
0,gov.administrators,prof,12351
1,general.managers,prof,25879
2,accountants,prof,9271
3,purchasing.officers,prof,8865
4,chemists,prof,8403
...,...,...,...
97,bus.drivers,bc,5562
98,taxi.drivers,bc,4224
99,longshoremen,bc,4753
100,typesetters,bc,6462


Task - 5 minutes
<br>
Return 2 rows at the same time, with the highest and lowest income!
<br>
[Hint - you can pass a list as an argument for loc()

In [54]:
job_incomes.loc[[job_incomes.income.idxmax(), job_incomes.income.idxmin()]]

Unnamed: 0,job,type,income
1,general.managers,prof,25879
62,babysitters,,611


### Sorting the Data

`.sort_values()` method

In [56]:
# sort the data by year

demographic_data.sort_values("year")

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090,29.4
3360,Cameroon,1900,5.54,924,29.7
480,Antigua and Barbuda,1900,4.63,1300,33.8
11280,Lithuania,1900,4.96,2740,41.7
11160,Libya,1900,7.20,2470,34.7
...,...,...,...,...,...
14159,Niger,2019,7.07,954,63.2
14039,Nicaragua,2019,2.12,4620,79.2
13919,New Zealand,2019,1.96,36500,81.9
13679,Nepal,2019,2.02,2880,71.5


In [57]:
# sort the data by year, then country

demographic_data.sort_values(["year", "country"])

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090,29.4
120,Albania,1900,4.60,1220,35.4
240,Algeria,1900,6.99,1750,30.2
360,Angola,1900,7.00,958,29.0
480,Antigua and Barbuda,1900,4.63,1300,33.8
...,...,...,...,...,...
21599,Venezuela,2019,2.25,9720,75.1
21719,Vietnam,2019,1.94,6970,74.7
21839,Yemen,2019,3.69,2340,68.1
21959,Zambia,2019,4.81,3700,64.0


#### Reverse sorting order

* `ascending=False`

In [58]:
# sort demographic data by descending life_expectancy

demographic_data.sort_values("life_expectancy", ascending=False)

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
17279,Singapore,2019,1.27,90100,85.10
17278,Singapore,2018,1.26,90100,85.00
17277,Singapore,2017,1.25,87800,84.80
17276,Singapore,2016,1.25,84700,84.70
9839,Japan,2019,1.50,39700,84.50
...,...,...,...,...,...
19938,Tonga,1918,6.51,969,5.96
3378,Cameroon,1918,5.54,1030,5.95
13444,Namibia,1904,5.96,1900,5.19
9993,Kazakhstan,1933,5.85,3120,4.07


### Overwriting the data

Perhaps we want to persist our changes.
<br>
`.sort_values()` and other methods have the `inplace=` parameter.
<br>
If we set that to `True` it will change the underlying data.

In [60]:
demographic_data.sort_values(["year", "country"], inplace=True)

demographic_data

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090,29.4
120,Albania,1900,4.60,1220,35.4
240,Algeria,1900,6.99,1750,30.2
360,Angola,1900,7.00,958,29.0
480,Antigua and Barbuda,1900,4.63,1300,33.8
...,...,...,...,...,...
21599,Venezuela,2019,2.25,9720,75.1
21719,Vietnam,2019,1.94,6970,74.7
21839,Yemen,2019,3.69,2340,68.1
21959,Zambia,2019,4.81,3700,64.0


Our demo_data has 0 null / na values.
<br>
When sorting, these go to the bottom.
<br>
`.tail()` could be useful to look.

### Changing our data

1. renaming columns
    * `rename()`:
```python
df.rename(columns={"old value":"new value"})
```

Let's rename `income_per_person` to `gdp`

In [62]:
demographic_data.rename(columns={"income_per_person":"gdp"}, inplace=True)

In [64]:
# can pass a dictionary with multiple k:v pairs
demographic_data.rename(columns={"country":"place", "year":"time"})

Unnamed: 0,place,time,fertility_rate,gdp,life_expectancy
0,Afghanistan,1900,7.00,1090,29.4
120,Albania,1900,4.60,1220,35.4
240,Algeria,1900,6.99,1750,30.2
360,Angola,1900,7.00,958,29.0
480,Antigua and Barbuda,1900,4.63,1300,33.8
...,...,...,...,...,...
21599,Venezuela,2019,2.25,9720,75.1
21719,Vietnam,2019,1.94,6970,74.7
21839,Yemen,2019,3.69,2340,68.1
21959,Zambia,2019,4.81,3700,64.0


### Changing our data

2. Changing columns
    * we can extract a column from a dataframe
    * we can modify the values of that column
    * we can overwrite our dataframe's original column with our modified column
```python
df.column = some_operation(df.column)
```
<br>

Let's quadruple the fertility rate (say it was incorrect)

In [66]:
demographic_data.fertility_rate = demographic_data.fertility_rate * 4

In [67]:
demographic_data

Unnamed: 0,country,year,fertility_rate,gdp,life_expectancy
0,Afghanistan,1900,28.00,1090,29.4
120,Albania,1900,18.40,1220,35.4
240,Algeria,1900,27.96,1750,30.2
360,Angola,1900,28.00,958,29.0
480,Antigua and Barbuda,1900,18.52,1300,33.8
...,...,...,...,...,...
21599,Venezuela,2019,9.00,9720,75.1
21719,Vietnam,2019,7.76,6970,74.7
21839,Yemen,2019,14.76,2340,68.1
21959,Zambia,2019,19.24,3700,64.0


### Creating new columns

2 main methods:

1. add a series
    * **does change the original data**
    
```python
df["new_column"] = some_operation(df.column)
```

2. use the `.assign()` method
    * **does not change the original data**

```python
df.assign(new_column = some_operation(df.column))
```

In [68]:
# series method

demographic_data["fertility_rate_updated"] = demographic_data.fertility_rate / 4

In [69]:
# assign method

demographic_data.assign(years_since_1900 = demographic_data.year - 1900)

Unnamed: 0,country,year,fertility_rate,gdp,life_expectancy,fertility_rate_updated,years_since_1900
0,Afghanistan,1900,28.00,1090,29.4,7.00,0
120,Albania,1900,18.40,1220,35.4,4.60,0
240,Algeria,1900,27.96,1750,30.2,6.99,0
360,Angola,1900,28.00,958,29.0,7.00,0
480,Antigua and Barbuda,1900,18.52,1300,33.8,4.63,0
...,...,...,...,...,...,...,...
21599,Venezuela,2019,9.00,9720,75.1,2.25,119
21719,Vietnam,2019,7.76,6970,74.7,1.94,119
21839,Yemen,2019,14.76,2340,68.1,3.69,119
21959,Zambia,2019,19.24,3700,64.0,4.81,119


In [70]:
demo_data_since_1900 = (
    demographic_data
    .assign(years_since_1900 = demographic_data.year - 1900)
    .copy()
)

In [71]:
demo_data_since_1900

Unnamed: 0,country,year,fertility_rate,gdp,life_expectancy,fertility_rate_updated,years_since_1900
0,Afghanistan,1900,28.00,1090,29.4,7.00,0
120,Albania,1900,18.40,1220,35.4,4.60,0
240,Algeria,1900,27.96,1750,30.2,6.99,0
360,Angola,1900,28.00,958,29.0,7.00,0
480,Antigua and Barbuda,1900,18.52,1300,33.8,4.63,0
...,...,...,...,...,...,...,...
21599,Venezuela,2019,9.00,9720,75.1,2.25,119
21719,Vietnam,2019,7.76,6970,74.7,1.94,119
21839,Yemen,2019,14.76,2340,68.1,3.69,119
21959,Zambia,2019,19.24,3700,64.0,4.81,119


In [72]:
demo_data_multiple_operations = (
    demographic_data
    .assign(
        # create new
        years_since_1900 = demographic_data.year - 1900,
        # modify existing
        fertility_rate = round(demographic_data.fertility_rate))
    .copy()
)

In [73]:
demo_data_multiple_operations

Unnamed: 0,country,year,fertility_rate,gdp,life_expectancy,fertility_rate_updated,years_since_1900
0,Afghanistan,1900,28.0,1090,29.4,7.00,0
120,Albania,1900,18.0,1220,35.4,4.60,0
240,Algeria,1900,28.0,1750,30.2,6.99,0
360,Angola,1900,28.0,958,29.0,7.00,0
480,Antigua and Barbuda,1900,19.0,1300,33.8,4.63,0
...,...,...,...,...,...,...,...
21599,Venezuela,2019,9.0,9720,75.1,2.25,119
21719,Vietnam,2019,8.0,6970,74.7,1.94,119
21839,Yemen,2019,15.0,2340,68.1,3.69,119
21959,Zambia,2019,19.0,3700,64.0,4.81,119


In [80]:
occupation_prestige.assign(income_000s = occupation_prestige.income / 1000)

Unnamed: 0,job,education,income,women,prestige,census,type,income_000s
0,gov.administrators,13.11,12351,11.16,68.8,1113,prof,12.351
1,general.managers,12.26,25879,4.02,69.1,1130,prof,25.879
2,accountants,12.77,9271,15.70,63.4,1171,prof,9.271
3,purchasing.officers,11.42,8865,9.11,56.8,1175,prof,8.865
4,chemists,14.62,8403,11.68,73.5,2111,prof,8.403
...,...,...,...,...,...,...,...,...
97,bus.drivers,7.58,5562,9.47,35.9,9171,bc,5.562
98,taxi.drivers,7.93,4224,3.59,25.1,9173,bc,4.224
99,longshoremen,8.37,4753,,26.1,9313,bc,4.753
100,typesetters,10.00,6462,13.58,42.2,9511,bc,6.462


### Summarising the data

1. `describe()` gives us
    * mean
    * min
    * median
    * max
2. `df.column.min()`
3. `df.column.median()`

In [81]:
# get the min fertility rate

demographic_data.fertility_rate.min()

4.48

In [82]:
# get the median life expectancy

demographic_data.life_expectancy.median()

53.55

Get a filtered table showing when life expectancy was lower than the median value

In [83]:
mask = demographic_data.life_expectancy < demographic_data.life_expectancy.median()

demographic_data.loc[mask]

Unnamed: 0,country,year,fertility_rate,gdp,life_expectancy,fertility_rate_updated
0,Afghanistan,1900,28.00,1090,29.4,7.00
120,Albania,1900,18.40,1220,35.4,4.60
240,Algeria,1900,27.96,1750,30.2,6.99
360,Angola,1900,28.00,958,29.0,7.00
480,Antigua and Barbuda,1900,18.52,1300,33.8,4.63
...,...,...,...,...,...,...
3836,Central African Republic,2016,19.48,731,51.7,4.87
11036,Lesotho,2016,12.36,2940,52.5,3.09
3837,Central African Republic,2017,19.20,754,51.9,4.80
3838,Central African Republic,2018,18.88,775,52.4,4.72


### Summarising the data for groups

* we can group by a column in the data
* calculate a summary stat for the groups
<br>
Get a table showing the max life expectancy for each country

In [88]:
(
    demographic_data
    # 1. group by country
    .groupby("country")
    # 2. select column of interest
    .life_expectancy
    # 3. define statistic of interest
    .max()
    # 4. (optional) reset to a dataframe with a meaningful column name
    .reset_index(name="max_life_expectancy")
)

Unnamed: 0,country,max_life_expectancy
0,Afghanistan,64.1
1,Albania,78.5
2,Algeria,78.1
3,Angola,65.0
4,Antigua and Barbuda,77.3
...,...,...
179,Venezuela,75.3
180,Vietnam,74.7
181,Yemen,69.0
182,Zambia,64.0


In [85]:
# get a mean of each variable for each country

demographic_data.groupby("country").mean()

Unnamed: 0_level_0,year,fertility_rate,gdp,life_expectancy,fertility_rate_updated
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,1959.5,28.325000,1828.216667,41.135583,7.081250
Albania,1959.5,16.092000,3628.833333,56.780833,4.023000
Algeria,1959.5,24.342000,6821.750000,51.734167,6.085500
Angola,1959.5,28.039333,3461.608333,41.305000,7.009833
Antigua and Barbuda,1959.5,14.273000,8393.833333,58.087500,3.568250
...,...,...,...,...,...
Venezuela,1959.5,19.270333,10359.333333,54.945000,4.817583
Vietnam,1959.5,17.504667,1718.200000,50.443333,4.376167
Yemen,1959.5,28.086333,2233.583333,40.059167,7.021583
Zambia,1959.5,26.277000,2112.041667,44.748333,6.569250


Task
<br>
For each question here, return a pandas series.
1. Find the maximum income for each type in the occupation_prestige data.
2. Find the average prestige for each type.
3. Find the lowest percentage of women for each type.

In [89]:
(
    occupation_prestige
    .groupby("type")
    .income
    .max()
)

type
bc       8895
prof    25879
wc       8780
Name: income, dtype: int64

In [90]:
(
    occupation_prestige
    .groupby("type")
    .prestige
    .mean()
)

type
bc      35.527273
prof    67.848387
wc      42.243478
Name: prestige, dtype: float64

In [91]:
(
    occupation_prestige
    .groupby("type")
    .women
    .min()
)

type
bc      0.52
prof    0.58
wc      3.16
Name: women, dtype: float64

### Dealing with missing values

1. replace values with something 'meaningful'
2. remove NA values
    * `.dropna`
3. keep them

In [92]:
# check for missing-ness
demographic_data.isna().sum()

country                   0
year                      0
fertility_rate            0
gdp                       0
life_expectancy           0
fertility_rate_updated    0
dtype: int64

In [102]:
# practice data

demo_missing = pd.read_csv("data/life_expectancy_and_income_missing.csv")

In [94]:
demo_missing.isna().sum()

country                0
year                   0
fertility_rate         0
income_per_person     14
life_expectancy      184
dtype: int64

In [95]:
demo_missing.shape

(22080, 5)

In [97]:
# drops all NA values from all columns

demo_missing.dropna().shape

(21882, 5)

In [100]:
# drop for a specific column

demo_missing.dropna(subset=["life_expectancy"]).isna().sum()

country               0
year                  0
fertility_rate        0
income_per_person    14
life_expectancy       0
dtype: int64

In [104]:
# drop for a specific column, persist with inplace=True

demo_missing.dropna(subset=["life_expectancy"], inplace=True)

In [105]:
# reload data back in

demo_missing = pd.read_csv("data/life_expectancy_and_income_missing.csv")

### Imputation simple

In [106]:
# replace NA values with 0

demo_missing.fillna(value = 0)

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.4
1,Afghanistan,1901,7.00,1110.0,0.0
2,Afghanistan,1902,7.00,1120.0,29.5
3,Afghanistan,1903,7.00,1140.0,29.6
4,Afghanistan,1904,7.00,1160.0,29.7
...,...,...,...,...,...
22075,Zimbabwe,2015,3.84,2510.0,59.6
22076,Zimbabwe,2016,3.76,2490.0,60.5
22077,Zimbabwe,2017,3.68,2570.0,61.4
22078,Zimbabwe,2018,3.61,2620.0,61.7


### Imputation more complex

```python
df.fillna(
    value = {
        "column_with_nas": value_to_replace_na_with
    }
)
```

In [107]:
demo_missing.fillna(
    value = {
        "life_expectancy": 0,
        "income_per_person": 50
    }
)

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.4
1,Afghanistan,1901,7.00,1110.0,0.0
2,Afghanistan,1902,7.00,1120.0,29.5
3,Afghanistan,1903,7.00,1140.0,29.6
4,Afghanistan,1904,7.00,1160.0,29.7
...,...,...,...,...,...
22075,Zimbabwe,2015,3.84,2510.0,59.6
22076,Zimbabwe,2016,3.76,2490.0,60.5
22077,Zimbabwe,2017,3.68,2570.0,61.4
22078,Zimbabwe,2018,3.61,2620.0,61.7


In [117]:
# replace missing data with mean()

demo_missing.fillna(
    value = {
        "life_expectancy": demo_missing.life_expectancy.mean(),
        "income_per_person": demo_missing.income_per_person.median()
    }
)

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.400000
1,Afghanistan,1901,7.00,1110.0,52.725449
2,Afghanistan,1902,7.00,1120.0,29.500000
3,Afghanistan,1903,7.00,1140.0,29.600000
4,Afghanistan,1904,7.00,1160.0,29.700000
...,...,...,...,...,...
22075,Zimbabwe,2015,3.84,2510.0,59.600000
22076,Zimbabwe,2016,3.76,2490.0,60.500000
22077,Zimbabwe,2017,3.68,2570.0,61.400000
22078,Zimbabwe,2018,3.61,2620.0,61.700000


### Imputation complex

In [110]:
# replace life expectancy missing values with mean() life expectancy for each country

# 1. get mean life expectancy for each country (to be same length as data)

mean_per_country = demo_missing.groupby("country").life_expectancy.transform("mean")

demo_missing.fillna(
    value = {
        "life_expectancy": mean_per_country,
        "income_per_person": demo_missing.income_per_person.median()
    }
)

Unnamed: 0,country,year,fertility_rate,income_per_person,life_expectancy
0,Afghanistan,1900,7.00,1090.0,29.400000
1,Afghanistan,1901,7.00,1110.0,41.233361
2,Afghanistan,1902,7.00,1120.0,29.500000
3,Afghanistan,1903,7.00,1140.0,29.600000
4,Afghanistan,1904,7.00,1160.0,29.700000
...,...,...,...,...,...
22075,Zimbabwe,2015,3.84,2510.0,59.600000
22076,Zimbabwe,2016,3.76,2490.0,60.500000
22077,Zimbabwe,2017,3.68,2570.0,61.400000
22078,Zimbabwe,2018,3.61,2620.0,61.700000


Task
<br>
Find which columns have missing values in occupation_prestige.
<br>
Replace all the missing values in type with "other". Check that this change has been made in the occupation_prestige.
* Find the average of women.
    * Without changing or removing missing values.
    * With all the missing values changed to 0.
    * With all the missing values dropped.

In [111]:
occupation_prestige.isna().sum()

job          0
education    0
income       0
women        5
prestige     0
census       0
type         4
dtype: int64

In [118]:
occupation_prestige.fillna(value = {"type":"other"}, inplace=True)

In [119]:
occupation_prestige.isna().sum()

job          0
education    0
income       0
women        5
prestige     0
census       0
type         0
dtype: int64

In [114]:
occupation_prestige.women.mean()

30.47278350515466

In [115]:
fill_0 = occupation_prestige.fillna(0)

fill_0.women.mean()

28.979019607843156

In [116]:
drop_na = occupation_prestige.dropna()

drop_na.women.mean()

30.544086021505397

In [121]:
(
    occupation_prestige
    .fillna(value = {
        "women": 0
    })
    .loc[(occupation_prestige.type != "other"), ["job", "income", "women", "type"]]
    .assign(income_1000s = occupation_prestige.income / 1000)
    .groupby("type")
    .income_1000s
    .mean()
)

type
bc       5.374136
prof    10.559452
wc       5.052304
Name: income_1000s, dtype: float64