<a href="https://colab.research.google.com/github/kKravtsova/data_and_python/blob/main/Describing_data_kkravtsova.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Describing data with summary statistics
---

This worksheet is tied to the Pandas Getting Started Tutorials, picking out particular tutorials to link them into a theme here.

We will focus on describing data.  This is the least risky in terms of bias and inaccurate conclusions as it should focus just on what data is presented to us.

Each exercise will ask you to work through on tutorial on the Getting Started page, to try the code from the tutorial here and to try a second, similar action.

---

The practice data from the tutorials comes from a dataset on Titanic passengers.


### Exercise 1 - open the Titanic dataset
---

The Titanic dataset is stored at this URL:
https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv

Read the dataset into a pandas dataframe that you will call **titanic**.

**Test output**:  
The shape of the dataframe will be (891, 12)

In [1]:
import pandas as pd

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic = pd.read_csv(url)
titanic.shape

(891, 12)

### Exercise 2 - get summary information about the dataframe
---

Read through the tutorials:  
[What kind of data does pandas handle?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#)  
[How do I read and write tabular data?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html)

Use panda functions to display the following:
1.  A technical summary of the data (info())
2.  A description of the numerical data (describe())
3.  Display the Series 'Age'

**Test output**:   
1.  The info should show that there are only 204 values in the Cabin series, out of 891 records.  
2.  The description should show 7 columns and a mean age of 29.699118
3.  The Age series should have values of type float64 and Length 891

In [2]:
titanic.info

<bound method DataFrame.info of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                   

In [3]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [4]:
# Display the Series 'Age'
age = titanic["Age"]
age

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

### Exercise 3 - aggregating statistics
---

Read through the tutorial:  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)  

Use panda functions to display the following summary statistics from the titanic dataset:  

1.  The average (mean) age of passengers  
2.  The median age and fare  
3.  The mean fare
4.  The modal fare and gender

**Test output**:   
29.699118, Age 28.0000 Fare 14.4542, 32.2042079685746, Fare 8.05 Sex male


In [5]:
# The average (mean) age of passengers
age.mean()

29.69911764705882

In [6]:
# The median age and fare
titanic[["Age", "Fare"]].median()

Age     28.0000
Fare    14.4542
dtype: float64

In [7]:
# The mean fare
fare = titanic["Fare"]
fare.mean()

32.204207968574636

In [8]:
# The modal fare and gender
titanic[["Fare", "Sex"]].mode()

Unnamed: 0,Fare,Sex
0,8.05,male


### Exercise 4 - displaying other statistics
---

Take a look at the list of methods available for giving summary statistics [here](https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats)

Use panda functions, and your existing knowledge, to display the following summary statistics from the titanic dataset:

1.  The total number of passengers on the titanic
2.  The age of the youngest passenger
3.  The most expensive ticket price
4.  The range of ticket prices
5.  The number of passenges with cabins
6.  The code for the port where the highest number of passengers embarked
7.  The most populous gender
8.  The standard deviation for age and fare

**Tests**:  
891, 0.42, 512.3292, 512.3292, 204, S, male, Age 14.526497 Fare 49.693429

In [9]:
# The total number of passengers on the titanic
titanic["PassengerId"].count()

891

In [10]:
# The age of the youngest passenger
age.min()

0.42

In [11]:
# The most expensive ticket price
fare_max = fare.max()
fare_max

512.3292

In [12]:
# The range of ticket prices
fare_min = fare.min()
fare_max - fare_min

512.3292

In [13]:
# The number of passenges with cabins
cabin_with = titanic[titanic["Cabin"].isnull() == False]["PassengerId"].count()
cabin_with

204

In [14]:
# The code for the port where the highest number of passengers embarked
titanic["Embarked"].mode()

0    S
Name: Embarked, dtype: object

In [15]:
# The most populous gender
titanic["Sex"].mode()

0    male
Name: Sex, dtype: object

In [16]:
# The standard deviation for age and fare
titanic[["Age", "Fare"]].std()

Age     14.526497
Fare    49.693429
dtype: float64

### Exercise 5 - aggregating statistics grouped by category
---

Refer again to the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)   
looking particularly at the section on Aggregating statistics grouped by category.

1.  What is the mean age for male versus female Titanic passengers?
2.  What is the mean ticket fare price for each of the sex and cabin class combinations?
3.  What is the mean ticket fare price for passengers who embarked at each port?
4.  Which passenger class had the highest number of survivors (for now, just show the statistics - it may not be meaningful yet)?

**Test output**:  
1.  female 27.915709 male 30.726645
2.  
```
female  1         106.125798
            2          21.970121
            3          16.118810
male    1          67.226127
            2          19.741782
            3          12.661633
```
3.  
```
C    59.954144
Q    13.276030
S    27.079812
```
4.

```
Survived  Pclass

0         1          80
          2          97
          3         372
1         1         136
          2          87
          3         119
```






In [29]:
# What is the mean age for male versus female Titanic passengers?
titanic[["Sex", "Age"]].groupby("Sex").mean()

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


In [39]:
# What is the mean ticket fare price for each of the sex and cabin class combinations?
titanic[["Sex", "Pclass", "Fare"]].groupby(["Sex", "Pclass"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare
Sex,Pclass,Unnamed: 2_level_1
female,1,106.125798
female,2,21.970121
female,3,16.11881
male,1,67.226127
male,2,19.741782
male,3,12.661633


In [44]:
# What is the mean ticket fare price for passengers who embarked at each port?
titanic[["Fare", "Embarked"]].groupby("Embarked").mean()

Unnamed: 0_level_0,Fare
Embarked,Unnamed: 1_level_1
C,59.954144
Q,13.27603
S,27.079812


In [46]:
# Which passenger class had the highest number of survivors (for now, just show the statistics - it may not be meaningful yet)?
titanic.groupby(["Survived", "Pclass"])["PassengerId"].count()

Survived  Pclass
0         1          80
          2          97
          3         372
1         1         136
          2          87
          3         119
Name: PassengerId, dtype: int64

### Exercise 6 - an aggregation of different statistics
---

Use the function titanic.agg() as shown in the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)  

1.  Display

```
     {
         "Age": ["min", "max", "median", "skew"],
         "Fare": ["min", "max", "median", "mean"]
     }
```
2.  Display:  
min, max and mean for Age  
min, max and standard deviation for Fare  
count for Cabin

**Test output**:   
1.
```
                  Age	      Fare  
max	   80.000000	512.329200  
mean	  NaN	      32.204208  
median	28.000000	14.454200  
min	   0.420000	 0.000000  
skew	  0.389108	 NaN
```

2.   
```
	        Age	    Fare	   Cabin
count  NaN	    NaN	    204.0
max	80.000000  512.329200 NaN
mean   29.699118  NaN        NaN
min	0.420000   0.000000   NaN
std	NaN        49.693429  NaN
```




In [42]:
# 1
titanic.agg({
        "Age": ["min", "max", "median", "skew"],
        "Fare": ["min", "max", "median", "mean"],
    })

Unnamed: 0,Age,Fare
min,0.42,0.0
max,80.0,512.3292
median,28.0,14.4542
skew,0.389108,
mean,,32.204208


In [43]:
# 2
titanic.agg({
        "Age": ["min", "max", "mean"],
        "Fare": ["min", "max", "std"],
        "Cabin": "count",
    })

Unnamed: 0,Age,Fare,Cabin
min,0.42,0.0,
max,80.0,512.3292,
mean,29.699118,,
std,,49.693429,
count,,,204.0


### Exercise 7 - count by category
---

Read the section Count number of records by category in the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)

1. Display the number of passengers of each gender who had a ticket
2. Display the number of passengers who embarked at each port and had a ticket
3. Calculate the percentage of PassengerIds who survived the sinking of the Titanic (*Hint:  try getting the PassengerIds with a count for survived or not.  Store this value in a new variable, which will contain a list/array.  The second item in this list will be the number who survived.  You can use this number and the count of PassengerIds to calculate the percentage*)

**Test output**:  
1.  female 314, male 577
2.  C 168, Q 77, S 644
3.  38.38383838383838



In [47]:
# Display the number of passengers of each gender who had a ticket
titanic["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [49]:
# Display the number of passengers who embarked at each port and had a ticket
titanic["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [54]:
# Calculate the percentage of PassengerIds who survived the sinking of the Titanic
# (Hint: try getting the PassengerIds with a count for survived or not. Store this value in a new variable, which will contain a list/array.
# The second item in this list will be the number who survived. You can use this number and the count of PassengerIds to calculate the percentage)
titanic["Survived"].value_counts()[1] / len(titanic["Survived"]) * 100

38.38383838383838

### Exercise 8 - summary happiness statistics
---

Open the data set here: https://github.com/futureCodersSE/working-with-data/blob/main/Happiness-Data/2019.xlsx?raw=true

It contains data on people's perception of happiness levels in a number of countries across the world.

1.  Display the number of records in the set  
2.  Display the description of the numerical data  
3.  Display the highest GDP and life expectancy  
4.  Display the mean, max and min for Freedom,  mean, max, min and skew for Generosity and mean, max, min and std for GDP  

**Test output**:  
1.  156
2.  Table showing count, mean, std, min, 25%, 50%, 75%, max for 8 columns
3.  GDP 1.684, life expectancy 1.141  
4.  


```
	   Freedom to make life choices	Generosity	GDP per capita
max	 0.631000	                   0.566000	  1.684000
mean	0.392571	                   0.184846	  0.905147
min	 0.000000	                   0.000000  	0.000000
skew	NaN	                        0.745942	  NaN
std	 NaN	                        NaN	       0.398389
```




In [62]:
# Load dataset
happy_url = 'https://github.com/futureCodersSE/working-with-data/blob/main/Happiness-Data/2019.xlsx?raw=true'
happy_df = pd.read_excel(happy_url)
happy_df

Unnamed: 0,Overall rank,Country or region,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
0,1,Finland,7.769,1.340,1.587,0.986,0.596,0.153,0.393
1,2,Denmark,7.600,1.383,1.573,0.996,0.592,0.252,0.410
2,3,Norway,7.554,1.488,1.582,1.028,0.603,0.271,0.341
3,4,Iceland,7.494,1.380,1.624,1.026,0.591,0.354,0.118
4,5,Netherlands,7.488,1.396,1.522,0.999,0.557,0.322,0.298
...,...,...,...,...,...,...,...,...,...
151,152,Rwanda,3.334,0.359,0.711,0.614,0.555,0.217,0.411
152,153,Tanzania,3.231,0.476,0.885,0.499,0.417,0.276,0.147
153,154,Afghanistan,3.203,0.350,0.517,0.361,0.000,0.158,0.025
154,155,Central African Republic,3.083,0.026,0.000,0.105,0.225,0.235,0.035


In [63]:
# Display the number of records in the set
len(happy_df)

156

In [65]:
# Display the description of the numerical data
happy_df.describe()

Unnamed: 0,Overall rank,Score,GDP per capita,Social support,Healthy life expectancy,Freedom to make life choices,Generosity,Perceptions of corruption
count,156.0,156.0,156.0,156.0,156.0,156.0,156.0,156.0
mean,78.5,5.407096,0.905147,1.208814,0.725244,0.392571,0.184846,0.110603
std,45.177428,1.11312,0.398389,0.299191,0.242124,0.143289,0.095254,0.094538
min,1.0,2.853,0.0,0.0,0.0,0.0,0.0,0.0
25%,39.75,4.5445,0.60275,1.05575,0.54775,0.308,0.10875,0.047
50%,78.5,5.3795,0.96,1.2715,0.789,0.417,0.1775,0.0855
75%,117.25,6.1845,1.2325,1.4525,0.88175,0.50725,0.24825,0.14125
max,156.0,7.769,1.684,1.624,1.141,0.631,0.566,0.453


In [68]:
# Display the highest GDP and life expectancy
happy_df[["GDP per capita", "Healthy life expectancy"]].max()

GDP per capita             1.684
Healthy life expectancy    1.141
dtype: float64

In [67]:
# Display the mean, max and min for Freedom, mean, max, min and skew for Generosity and mean, max, min and std for GDP
happy_df.agg(
    { "Freedom to make life choices": ["mean", "max", "min"],
      "Generosity": ["mean", "max", "min", "skew"],
      "GDP per capita": ["mean", "max", "min", "std"],
    }
)

Unnamed: 0,Freedom to make life choices,Generosity,GDP per capita
mean,0.392571,0.184846,0.905147
max,0.631,0.566,1.684
min,0.0,0.0,0.0
skew,,0.745942,
std,,,0.398389


### Exercise 9 - migration data
---

Open the dataset at this url: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true  Open the sheet named *Country Migration*

1.  Describe the dataset  
2.  Show summary information
3.  Display the mean net per 10K migration in each of the years 2015 to 2019
4.  Display the mean, max and min migration for the year 2019 for each of the regions (*base_country_wb_region*)
5.  Display the median net migration for the years 2015 and 2019 for the base countries by income level
6.  Display the number of target countries in each income level
7.  Display the mean net migration for all five years, for each income level

**Test output**:  
1  count, mean, std, min, 25%, 50%, 75%, max for 9 columns

2  shows 16 columns with non-null count of 4148 in each column

3  
```
net_per_10K_2015    0.461757
net_per_10K_2016    0.150248
net_per_10K_2017   -0.080272
net_per_10K_2018   -0.040591
net_per_10K_2019   -0.022743
dtype: float64
```

4  
```
	net_per_10K_2019
mean	max	min
base_country_wb_region
East Asia & Pacific	0.198827	21.57	-9.88
Europe & Central Asia	0.208974	87.71	-21.34
Latin America & Caribbean	-0.904602	21.15	-31.75
Middle East & North Africa	-0.107655	55.60	-50.33
North America	0.239246	23.20	-0.29
South Asia	-0.514577	13.72	-24.89
Sub-Saharan Africa	-0.279729	37.11	-21.54
```

5  
```
	net_per_10K_2015	net_per_10K_2019
base_country_wb_income
High Income	0.02	0.04
Low Income	0.42	-0.05
Lower Middle Income	-0.02	-0.07
Upper Middle Income	-0.03	-0.08
```

6  
```
base_country_wb_income
High Income            2415
Low Income              185
Lower Middle Income     653
Upper Middle Income     895
Name: target_country_name, dtype: int64
```

7  
```
net_per_10K_2015	net_per_10K_2016	net_per_10K_2017	net_per_10K_2018	net_per_10K_2019
base_country_wb_income
High Income	0.505482	0.391379	0.314178	0.379201	0.401470
Low Income	1.876432	0.798270	-0.684865	-0.677784	-0.681459
Lower Middle Income	0.591654	-0.029893	-0.519433	-0.527136	-0.476616
Upper Middle Income	-0.043419	-0.502916	-0.699240	-0.686626	-0.700101
```






In [74]:
# Load Country Migration dataset
CM_url = "https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true"
CM_df = pd.read_excel(CM_url, sheet_name="Country Migration")
CM_df

Unnamed: 0,base_country_code,base_country_name,base_lat,base_long,base_country_wb_income,base_country_wb_region,target_country_code,target_country_name,target_lat,target_long,target_country_wb_income,target_country_wb_region,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,af,Afghanistan,33.939110,67.709953,Low Income,South Asia,0.19,0.16,0.11,-0.05,-0.02
1,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,dz,Algeria,28.033886,1.659626,Upper Middle Income,Middle East & North Africa,0.19,0.25,0.57,0.55,0.78
2,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,ao,Angola,-11.202692,17.873887,Lower Middle Income,Sub-Saharan Africa,-0.01,0.04,0.11,-0.02,-0.06
3,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,ar,Argentina,-38.416097,-63.616672,High Income,Latin America & Caribbean,0.16,0.18,0.04,0.01,0.23
4,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,am,Armenia,40.069099,45.038189,Upper Middle Income,Europe & Central Asia,0.10,0.05,0.03,-0.01,0.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4143,zw,Zimbabwe,-19.015438,29.154857,Low Income,Sub-Saharan Africa,za,South Africa,-30.559482,22.937506,Upper Middle Income,Sub-Saharan Africa,-2.98,-11.79,-9.10,-12.08,-20.76
4144,zw,Zimbabwe,-19.015438,29.154857,Low Income,Sub-Saharan Africa,ae,United Arab Emirates,23.424076,53.847818,High Income,Middle East & North Africa,-2.50,-2.49,-2.21,-1.68,-3.19
4145,zw,Zimbabwe,-19.015438,29.154857,Low Income,Sub-Saharan Africa,gb,United Kingdom,55.378051,-3.435973,High Income,Europe & Central Asia,3.91,4.66,0.74,-0.66,-1.97
4146,zw,Zimbabwe,-19.015438,29.154857,Low Income,Sub-Saharan Africa,us,United States,37.090240,-95.712891,High Income,North America,38.60,37.76,10.09,6.06,5.25


In [75]:
# Describe the dataset
CM_df.describe()

Unnamed: 0,base_lat,base_long,target_lat,target_long,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
count,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0,4148.0
mean,28.418022,21.698305,28.418022,21.698305,0.461757,0.150248,-0.080272,-0.040591,-0.022743
std,25.086012,61.937381,25.086012,61.937381,5.00653,4.201118,3.203092,3.593876,3.633247
min,-40.900557,-106.346771,-40.900557,-106.346771,-37.01,-40.89,-43.66,-56.22,-50.33
25%,14.058324,-3.435973,14.058324,-3.435973,-0.15,-0.19,-0.21,-0.21,-0.21
50%,35.86166,19.145136,35.86166,19.145136,0.0,0.0,0.0,0.0,0.0
75%,47.516231,53.688046,47.516231,53.688046,0.24,0.22,0.16,0.17,0.18
max,64.963051,179.414413,64.963051,179.414413,150.68,124.48,87.0,91.41,87.71


In [93]:
# Show summary information
CM_df.count()

base_country_code           4148
base_country_name           4148
base_lat                    4148
base_long                   4148
base_country_wb_income      4148
base_country_wb_region      4148
target_country_code         4148
target_country_name         4148
target_lat                  4148
target_long                 4148
target_country_wb_income    4148
target_country_wb_region    4148
net_per_10K_2015            4148
net_per_10K_2016            4148
net_per_10K_2017            4148
net_per_10K_2018            4148
net_per_10K_2019            4148
dtype: int64

In [78]:
# Display the mean net per 10K migration in each of the years 2015 to 2019
CM_df[["net_per_10K_2015", "net_per_10K_2016", "net_per_10K_2017",	"net_per_10K_2018",	"net_per_10K_2019"]].mean()

net_per_10K_2015    0.461757
net_per_10K_2016    0.150248
net_per_10K_2017   -0.080272
net_per_10K_2018   -0.040591
net_per_10K_2019   -0.022743
dtype: float64

In [84]:
# Display the mean, max and min migration for the year 2019 for each of the regions (base_country_wb_region)
grouped_CM_2019 = CM_df[["net_per_10K_2019", "base_country_wb_region"]].groupby("base_country_wb_region")
grouped_CM_2019.agg({
    "net_per_10K_2019": ["mean", "max", "min"]
})

Unnamed: 0_level_0,net_per_10K_2019,net_per_10K_2019,net_per_10K_2019
Unnamed: 0_level_1,mean,max,min
base_country_wb_region,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
East Asia & Pacific,0.198827,21.57,-9.88
Europe & Central Asia,0.208974,87.71,-21.34
Latin America & Caribbean,-0.904602,21.15,-31.75
Middle East & North Africa,-0.107655,55.6,-50.33
North America,0.239246,23.2,-0.29
South Asia,-0.514577,13.72,-24.89
Sub-Saharan Africa,-0.279729,37.11,-21.54


In [88]:
# Display the median net migration for the years 2015 and 2019 for the base countries by income level
CM_df[["net_per_10K_2015", "net_per_10K_2019", "base_country_wb_income"]].groupby("base_country_wb_income").median()

Unnamed: 0_level_0,net_per_10K_2015,net_per_10K_2019
base_country_wb_income,Unnamed: 1_level_1,Unnamed: 2_level_1
High Income,0.02,0.04
Low Income,0.42,-0.05
Lower Middle Income,-0.02,-0.07
Upper Middle Income,-0.03,-0.08


In [89]:
# Display the number of target countries in each income level
CM_df["base_country_wb_income"].value_counts()

High Income            2415
Upper Middle Income     895
Lower Middle Income     653
Low Income              185
Name: base_country_wb_income, dtype: int64

In [92]:
# Display the mean net migration for all five years, for each income level
CM_df[["net_per_10K_2015", "net_per_10K_2016", "net_per_10K_2017",	"net_per_10K_2018",	"net_per_10K_2019", "base_country_wb_income"]].groupby("base_country_wb_income").mean()

Unnamed: 0_level_0,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
base_country_wb_income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
High Income,0.505482,0.391379,0.314178,0.379201,0.40147
Low Income,1.876432,0.79827,-0.684865,-0.677784,-0.681459
Lower Middle Income,0.591654,-0.029893,-0.519433,-0.527136,-0.476616
Upper Middle Income,-0.043419,-0.502916,-0.69924,-0.686626,-0.700101


### Exercise 10 - calculating range over a grouped series

Open the dataset at this url: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-talent-migration.xlsx?raw=true Open the sheet named *Skill Migration*

1.  Display the max for each skill group category of net migration for the year 2017
2.  Assign the max for each skill group category of net migration for the year 2017 to a variable called **max_skill_migration** and print `max_skill_migration`
3.  Create a second variable called **min_skill_migration** and assign to it the min for each skill group category of net migration for the year 2017, print `min_skill_migration`

4.  You now have two series `max_skill_migration` and `min_skill_migration` each of which is a numpy array.  You can perfom calculations on these two series in the same way as you would individual data items.

So, you can calculate the range for each skill category by subtracting the `min_skill_migration` from `max_skill_migration` to get a new series **skill_migration_range**

skill_migration_range = max_skill_migration - min_skill_migration

Try it out.

5.  Now calculate the range for the year 2019
6.  Now calculate the range for countries grouped by base country income level for the year 2015

Test output:  
1 and 2  
```
skill_group_category
Business Skills                1048.20
Disruptive Tech Skills         1478.56
Soft Skills                    1572.35
Specialized Industry Skills    1906.14
Tech Skills                    1336.78
Name: net_per_10K_2017, dtype: float64
```

3    
```
skill_group_category
Business Skills               -3471.35
Disruptive Tech Skills        -2646.19
Soft Skills                   -2542.23
Specialized Industry Skills   -6604.67
Tech Skills                   -6060.98
Name: net_per_10K_2017, dtype: float64
```

4  
```
skill_group_category
Business Skills                4519.55
Disruptive Tech Skills         4124.75
Soft Skills                    4114.58
Specialized Industry Skills    8510.81
Tech Skills                    7397.76
Name: net_per_10K_2017, dtype: float64
```

5  
```
skill_group_category
Business Skills                4543.96
Disruptive Tech Skills         3651.81
Soft Skills                    5528.47
Specialized Industry Skills    4036.44
Tech Skills                    3424.45
Name: net_per_10K_2019, dtype: float64
```

6  
```
wb_income
High income            4246.50
Low income             4556.42
Lower middle income    2148.36
Upper middle income    4045.43
Name: net_per_10K_2015, dtype: float64
```





In [94]:
# Open the sheet named Skill Migration
SM_df = pd.read_excel(CM_url, sheet_name = "Skill Migration")
SM_df

Unnamed: 0,country_code,country_name,wb_income,wb_region,skill_group_id,skill_group_category,skill_group_name,net_per_10K_2015,net_per_10K_2016,net_per_10K_2017,net_per_10K_2018,net_per_10K_2019
0,af,Afghanistan,Low income,South Asia,2549,Tech Skills,Information Management,-791.59,-705.88,-550.04,-680.92,-1208.79
1,af,Afghanistan,Low income,South Asia,2608,Business Skills,Operational Efficiency,-1610.25,-933.55,-776.06,-532.22,-790.09
2,af,Afghanistan,Low income,South Asia,3806,Specialized Industry Skills,National Security,-1731.45,-769.68,-756.59,-600.44,-767.64
3,af,Afghanistan,Low income,South Asia,50321,Tech Skills,Software Testing,-957.50,-828.54,-964.73,-406.50,-739.51
4,af,Afghanistan,Low income,South Asia,1606,Specialized Industry Skills,Navy,-1510.71,-841.17,-842.32,-581.71,-718.64
...,...,...,...,...,...,...,...,...,...,...,...,...
17612,zw,Zimbabwe,Low income,Sub-Saharan Africa,12666,Specialized Industry Skills,Teaching,71.18,30.68,-18.85,-68.89,-93.70
17613,zw,Zimbabwe,Low income,Sub-Saharan Africa,1235,Specialized Industry Skills,Mining,8.97,-112.85,-35.87,-65.38,-93.46
17614,zw,Zimbabwe,Low income,Sub-Saharan Africa,43756,Specialized Industry Skills,Personal Coaching,-53.45,-59.70,-88.01,-55.90,-82.23
17615,zw,Zimbabwe,Low income,Sub-Saharan Africa,1724,Specialized Industry Skills,Public Health,15.25,-65.53,-57.22,-39.39,-32.14


In [96]:
# Display the max for each skill group category of net migration for the year 2017
# Assign the max for each skill group category of net migration for the year 2017 to a variable called max_skill_migration and print max_skill_migration

max_skill_migration = SM_df[["net_per_10K_2017", "skill_group_category"]].groupby("skill_group_category").max()
max_skill_migration

Unnamed: 0_level_0,net_per_10K_2017
skill_group_category,Unnamed: 1_level_1
Business Skills,1048.2
Disruptive Tech Skills,1478.56
Soft Skills,1572.35
Specialized Industry Skills,1906.14
Tech Skills,1336.78


In [97]:
# Create a second variable called min_skill_migration and assign to it the min for each skill group category of net migration for the year 2017, print min_skill_migration
min_skill_migration = SM_df[["net_per_10K_2017", "skill_group_category"]].groupby("skill_group_category").min()
min_skill_migration

Unnamed: 0_level_0,net_per_10K_2017
skill_group_category,Unnamed: 1_level_1
Business Skills,-3471.35
Disruptive Tech Skills,-2646.19
Soft Skills,-2542.23
Specialized Industry Skills,-6604.67
Tech Skills,-6060.98


In [98]:
# You now have two series max_skill_migration and min_skill_migration each of which is a numpy array. You can perfom calculations on these two series in the same way as you would individual data items.
# So, you can calculate the range for each skill category by subtracting the min_skill_migration from max_skill_migration to get a new series skill_migration_range
skill_migration_range = max_skill_migration - min_skill_migration
skill_migration_range

Unnamed: 0_level_0,net_per_10K_2017
skill_group_category,Unnamed: 1_level_1
Business Skills,4519.55
Disruptive Tech Skills,4124.75
Soft Skills,4114.58
Specialized Industry Skills,8510.81
Tech Skills,7397.76


In [101]:
# Now calculate the range for the year 2019
skill_migration_2019 = SM_df[["net_per_10K_2019", "skill_group_category"]].groupby("skill_group_category")
skill_migration_range_2019 = skill_migration_2019.max() - skill_migration_2019.min()
skill_migration_range_2019

Unnamed: 0_level_0,net_per_10K_2019
skill_group_category,Unnamed: 1_level_1
Business Skills,4543.96
Disruptive Tech Skills,3651.81
Soft Skills,5528.47
Specialized Industry Skills,4036.44
Tech Skills,3424.45


In [103]:
# Now calculate the range for countries grouped by base country income level for the year 2015
income_skill_migration_2015 = SM_df[["net_per_10K_2015", "wb_income"]].groupby("wb_income")
income_skill_migration_range_2015 = income_skill_migration_2015.max() - income_skill_migration_2015.min()
income_skill_migration_range_2015

Unnamed: 0_level_0,net_per_10K_2015
wb_income,Unnamed: 1_level_1
High income,4246.5
Low income,4556.42
Lower middle income,2148.36
Upper middle income,4045.43
