<a href="https://colab.research.google.com/github/lilaceri/Working-with-data-/blob/main/Describing_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Describing data with summary statistics
---

This worksheet is tied to the Pandas Getting Started Tutorials, picking out particular tutorials to link them into a theme here.

We will focus on describing data.  This is the least risky in terms of bias and inaccurate conclusions as it should focus just on what data is presented to us.

Each exercise will ask you to work through on tutorial on the Getting Started page, to try the code from the tutorial here and to try a second, similar action.

---

The practice data from the tutorials comes from a dataset on Titanic passengers.


### Exercise 1 - open the Titanic dataset
---

The Titanic dataset is stored at this URL:
https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv

Read the dataset into a pandas datatable that you will call **titanic**.

**Test output**:  
The shape of the dataframe will be (891, 12)

In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic = pd.read_csv(url)
titanic.shape

(891, 12)

### Exercise 2 - get summary information about the dataframe
---

Read through the tutorials:  
[What kind of data does pandas handle?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#)  
[How do I read and write tabular data?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html)

Use panda functions to display the following:
1.  A technical summary of the data (info())
2.  A description of the numerical data (describe())
3.  The Series 'Age'

**Test output**:   
1.  The info should show that there are only 204 values in the Cabin series, out of 891 records.  
2.  The description should show 7 columns and a mean age of 20.9699118
3.  The Age series should have values of type float64

### Exercise 3 - aggregating statistics
---

Read through the tutorial:  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)  

Use panda functions to display the following summary statistics from the titanic dataset:  

1.  The average (mean) age of passengers  
2.  The median age and fare  
3.  The mean fare
4.  The modal fare and gender

**Test output**:   
20.9699118, Age 28.0000 Fare 14.4542, 32.2042079685746, Fare 8.05 Sex male 


### Exercise 4 - displaying other statistics
---

Take a look at the list of methods available for giving summary statistics [here](https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats) 

Use panda functions, and your existing knowledge, to display the following summary statistics from the titanic dataset:

1.  The total number of passengers on the titanic
2.  The age of the youngest passenger
3.  The most expensive ticket price
4.  The range of ticket prices
5.  The number of passenges with cabins
6.  The code for the port where the highest number of passengers embarked
7.  The most populous gender
8.  The standard deviation for age and fare

**Tests**:  
891, 0.42, 512.3292, 512.3292, 204, S, male, Age 14.526497 Fare 49.693429

### Exercise 5 - aggregating statistics grouped by category
---

Refer again to the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)   
looking particularly at the section on Aggregating statistics grouped by category.

1.  What is the mean age for male versus female Titanic passengers?
2.  What is the mean ticket fare price for each of the sex and cabin class combinations?
3.  What is the mean ticket fare price for passengers who embarked at each port?
4.  Which passenger class had the highest number of survivors (for now, just show the statistics - it may not be meaningful yet)?

**Test output**:  
1.  female 27.915709 male 30.726645
2.  
```
female  1         106.125798
            2          21.970121
            3          16.118810
male    1          67.226127
            2          19.741782
            3          12.661633
```
3.  
```
C    59.954144
Q    13.276030
S    27.079812
```
4. 
```
Embarked  Survived
Survived  Embarked
0         C            75
             Q            47
             S           427
1         C            93
             Q            30
             S           217
```


### Exercise 6 - an aggregation of different statistics
---

Use the function titanic.agg() as shown in the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)  

1.  Display

```
     {
         "Age": ["min", "max", "median", "skew"],
         "Fare": ["min", "max", "median", "mean"],
     }
```
2.  Display:  
min, max and mean for Age  
min, max and standard deviation for Fare  
value_count for Cabin

**Test output**:   
1.  	
```
                  Age	      Fare  
max	   80.000000	512.329200  
mean	  NaN	      32.204208  
median	28.000000	14.454200  
min	   0.420000	 0.000000  
skew	  0.389108	 NaN
```

2.   

![aggregation results](https://drive.google.com/uc?id=11bvvNz7E0d8bhTyhJyHF-E2yFQGxP66N)


### Exercise 7 - count by category
---

Read the section Count number of records by category in the tutorial  
[How to calculate summary statistics?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#)

1. Display the number of passengers of each gender who had a ticket
2. Display the number of passengers who embarked at each port and had a ticket
3. Calculate the percentage of PassengerIds who survived the sinking of the Titanic (*Hint:  try getting the PassengerIds with a count for survived or not.  Store this value in a new variable, which will contain a list/array.  The second item in this list will be the number who survived.  You can use this number and the count of PassengerIds to calculate the percentage*)

**Test output**:  
1.  female 314, male 577
2.  C 168, Q 77, S 644
3.  38.38383838383838



### Exercise 8 - summary happiness statistics
---

Open the data set here: https://github.com/futureCodersSE/working-with-data/blob/main/Happiness-Data/2019.xlsx?raw=true

It contains data on people's perception of happiness levels in a number of countries across the world.

1.  Display the number of records in the set  
2.  Display the description of the numerical data  
3.  Display the highest GDP and life expectancy  
4.  Display the mean, max and min for Freedom,  mean, max, min and skew for Generosity and mean, min, max and std for GDP  

**Test output**:  
1.  156
2.  Table showing count, mean, std, min, 25%, 50%, 75%, max for 8 columns
3.  GDP 0.905147, life expectancy 0.725244  
4.  


```
	   Freedom to make life choices	Generosity	GDP per capita
max	 0.631000	                   0.566000	  1.684000
mean	0.392571	                   0.184846	  0.905147
min	 0.000000	                   0.000000  	0.000000
skew	NaN	                        0.745942	  NaN
std	 NaN	                        NaN	       0.398389
```




### Exercise 9 - migration data
---

Open the dataset at this url: https://github.com/futureCodersSE/working-with-data/blob/main/Data%20sets/public_use-industry-skills-needs.xlsx?raw=true  Open the sheet named *Country Migration*

1.  Describe the dataset  
2.  Show summary statistics information
3.  Display the mean net per 10K migration in each of the years 2015 to 2019
4.  Display the mean, max and min migration for the year 2019 for each of the regions (*base_country_wb_region*)
5.  Display the median net migration for the years 2015 and 2019 for the base countries by income level
6.  Display the number of target countries in each income level
7.  Display the mean net migration for 

