# Lab practice: Essential functionality in Pandas

In this lab, you will apply Panda DataFrame objects to work with tabular data and write queries to retrieve data from a DataFrame.

Learning objectives:

1. Create dataframes from generated data
2. Working with tabular data

In [1]:
import pandas as pd
import numpy as np

## Part I: Creating dataframes from generated data

### Problem 1: Creating a matrix from a sequence

Create a matrix of size 10x10 (10 rows and 10 columns) filled with a sequence of integers from 0 to 99.

Tips:
- Use the range *np.arange()* function to generate 100 integers and then use the *reshape()* function to generate the 10x10 matrix

In [18]:
# YOUR SOLUTION

In [19]:
# Example of expected output

print("""
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])
""")


array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])



### Problem 2: Creating a dataframe from a sequence

Create a dataframe of 10 rows and 10 columns filled with a sequence of integers from 0 to 99. Name the columns with the prefix 'col_' followed by an integer starting from 1 indicating the position of the column.

Tips:
- Use your solution from Problem 1 to generate a 10x10 matrix
- Construct a list of strings the prefix 'col_' followed by an integer starting from 1. You can use a for loop or the list comprehension technique. Then, use this list of string as the columns parameter when creating the DataFrame

In [20]:
# YOUR SOLUTION

In [21]:
# Example of expected output

print("""
	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9	col_10
0	0	1	2	3	4	5	6	7	8	9
...
9	90	91	92	93	94	95	96	97	98	99
""")


	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9	col_10
0	0	1	2	3	4	5	6	7	8	9
...
9	90	91	92	93	94	95	96	97	98	99



### Problem 3: Dropping a column

Create a new DataFrame by dropping the column *col_10* from the DataFrame created in the previous problem.

In [22]:
# YOUR SOLUTION

### Problems 4-5: Filtering  

Retrieve all rows in the column 'col_1'

In [32]:
# YOUR SOLUTION

Retrieve all rows and columns where the column 'col_1' value is greater than 50.

In [33]:
# Example of expected output

print("""
	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9	col_10
6	60	61	62	63	64	65	66	67	68	69
7	70	71	72	73	74	75	76	77	78	79
8	80	81	82	83	84	85	86	87	88	89
9	90	91	92	93	94	95	96	97	98	99
...
""")


	col_1	col_2	col_3	col_4	col_5	col_6	col_7	col_8	col_9	col_10
6	60	61	62	63	64	65	66	67	68	69
7	70	71	72	73	74	75	76	77	78	79
8	80	81	82	83	84	85	86	87	88	89
9	90	91	92	93	94	95	96	97	98	99
...



In [34]:
# YOUR SOLUTION

## Part II: Working with tabular data

For the following problems, you will be working with a dataset from NYC about causes of death from 2007 to 2014. The dataset is in CSV format. We are interested in the following fields:
 - Year
 - Ethnicity
 - Sex
 - Cause of Death
 
The end goal is to determine the leading causes of death for males and females, according to the data.

Load the dataset:

In [35]:
df_original = pd.read_csv('data/nyc_deaths.csv')
df_original.head()

Unnamed: 0,Year,Cause of Death,Sex,Ethnicity,Count,Death Rate,Age Adjusted Death Rate
0,2010,Influenza (Flu) and Pneumonia (J09-J18),F,Hispanic,228,18.7,23.1
1,2008,"Accidents Except Drug Posioning (V01-X39, X43,...",F,Hispanic,68,5.8,6.6
2,2013,"Accidents Except Drug Posioning (V01-X39, X43,...",M,White Non-Hispanic,271,20.1,17.9
3,2010,Cerebrovascular Disease (Stroke: I60-I69),M,Hispanic,140,12.3,21.4
4,2009,"Assault (Homicide: Y87.1, X85-Y09)",M,Black Non-Hispanic,255,30.0,30.0


Number of rows and attributes:

In [36]:
df_original.shape

(1094, 7)

In the following problems, you will be going to modify the DataFrame created from the CSV source. Thus, we will create a copy and leave the original DataFrame intact.

From now on, use the *df* DataFrame.

In [44]:
df = df_original.copy()

### Problem 6: Data cleaning (part I)

Before using the data for doing any analysis, your first need to perform some cleaning operations. 

Apply your current knowledge on manipulating data in a DataFrame to drop the Death Rate and Age Adjusted Death Rate columns as we are not going to use them

Tips:
- Use the [inplace](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) parameter of the drop method to modify the DataFrame, instead of creating a new DataFrame

In [37]:
# Example of expected output

print("""
	Year	Cause of Death	Sex	Ethnicity	Count
0	2010	Influenza ...	F	Hispanic	228
1	2008	Accidents ...	F	Hispanic	68
2	2013	Accidents ...	M	White...	271
3	2010	Cerebrov...	M	Hispanic	140
4	2009	Assault ...	M	Black ...	255
...
""")


	Year	Cause of Death	Sex	Ethnicity	Count
0	2010	Influenza ...	F	Hispanic	228
1	2008	Accidents ...	F	Hispanic	68
2	2013	Accidents ...	M	White...	271
3	2010	Cerebrov...	M	Hispanic	140
4	2009	Assault ...	M	Black ...	255
...



In [21]:
# YOUR SOLUTION

### Problem 7: Data cleaning (part II)

Note that the DataFrame contain some incorrect or non-numeric values in the 'Count' column such as dots (.):

In [45]:
df[df['Count'] == "."]

Unnamed: 0,Year,Cause of Death,Sex,Ethnicity,Count,Death Rate,Age Adjusted Death Rate
5,2012,Mental and Behavioral Disorders due to Acciden...,F,Other Race/ Ethnicity,.,.,.
9,2009,Alzheimer's Disease (G30),F,Other Race/ Ethnicity,.,.,.
15,2011,"Accidents Except Drug Posioning (V01-X39, X43,...",F,Other Race/ Ethnicity,.,.,.
18,2008,"Chronic Liver Disease and Cirrhosis (K70, K73)",F,Not Stated/Unknown,.,.,.
19,2007,Alzheimer's Disease (G30),F,Not Stated/Unknown,.,.,.
...,...,...,...,...,...,...,...
1071,2014,"Chronic Liver Disease and Cirrhosis (K70, K73)",F,Not Stated/Unknown,.,.,.
1077,2008,"Accidents Except Drug Posioning (V01-X39, X43,...",F,Other Race/ Ethnicity,.,.,.
1078,2009,Diabetes Mellitus (E10-E14),M,Other Race/ Ethnicity,.,.,.
1085,2010,Atherosclerosis (I70),F,Not Stated/Unknown,.,.,.


Most likely, these values are not available when collecting the data and were given a default placeholder value.

Remove invalid values in the Count column, preserving only those records with a valid number.

Tips:
- You can use the *drop* method to remove those records. Thus, this time use *drop* with axis = 0, which is the default axis. To make it work, you need to specify the index of the rows you want to remove. Use the *index* property of the DataFrame that results from filtering the rows with a '.' in the 'Count' column.
- Remember to use the inplace=True parameter when calling the *drop* method.

In [51]:
df[df['Count'] == "."].index

Int64Index([   5,    9,   15,   18,   19,   34,   39,   43,   45,   46,
            ...
            1027, 1038, 1044, 1047, 1067, 1071, 1077, 1078, 1085, 1088],
           dtype='int64', length=138)

In [52]:
# YOUR SOLUTION

### Problem 8: Data cleaning (part III)

Now, the 'Count' column contains only numeric values; thus, it should have a numeric value type associated with it. However, its value type is 'object', which Pandas uses to represent strings and other types of values. 

In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1094 entries, 0 to 1093
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Year                     1094 non-null   int64 
 1   Cause of Death           1094 non-null   object
 2   Sex                      1094 non-null   object
 3   Ethnicity                1094 non-null   object
 4   Count                    1094 non-null   object
 5   Death Rate               1094 non-null   object
 6   Age Adjusted Death Rate  1094 non-null   object
dtypes: int64(1), object(6)
memory usage: 60.0+ KB


In [54]:
df['Count'].dtype

dtype('O')

Convert the 'Count' column to a numeric dtype, so we can use perform arithmetic operations.

Use the [pandas.to_numeric](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html) function. For example,

`df['Count'] = pd.to_numeric(df['Count'])`

In [21]:
# YOUR SOLUTION

Use the *describe()* method to generate descriptive statistics on the numeric values in your DataFrame:

In [None]:
df.describe()

Now that you have cleared out all the data, we are ready to use the dataset to answer some questions.

### Problem 9: How many male records and how many female records are there in the data?

In [57]:
# YOUR SOLUTION

### Problem 10: What ethnicities are included in the data for females?

Tips:
- Use the [unique](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html) method

In [30]:
# YOUR SOLUTION

### Problem 11:  What are the three most frequent causes of death? (EXTRA)

To answer this question, aggregate the records by cause of death and list the counts for each cause of death in descending order.

One way to aggreate records is to use the [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method.

The groupby method behaves very similarlty to the GROUP BY clause in SQL. The groupby function combines records with the same value and it usually involves an arithmetic operation that dictates how to aggregate multiple values:

In [62]:
data_frame = pd.DataFrame(
    [{'player': 'james', 'score':30}, {'player': 'james', 'score':20}, {'player': 'michael', 'score':30}])
data_frame

Unnamed: 0,player,score
0,james,30
1,james,20
2,michael,30


We can use *groupby* to aggregate the scores of each player by adding all the scores corresponding to each player:

In [63]:
grouped_data_frame = data_frame.groupby(['player']).sum()
grouped_data_frame

Unnamed: 0_level_0,score
player,Unnamed: 1_level_1
james,50
michael,30


> *groupby* creates a new DataFrame with rows representing the results of the aggregation in each column.

Then, we can use the *sort_values* method to sort our DataFrame:

In [68]:
grouped_data_frame.sort_values(by="score", ascending = True)

Unnamed: 0_level_0,score
player,Unnamed: 1_level_1
michael,30
james,50


In [69]:
grouped_data_frame.sort_values(by="score", ascending = False)

Unnamed: 0_level_0,score
player,Unnamed: 1_level_1
james,50
michael,30


In [32]:
# YOUR SOLUTION

In [64]:
# Example of expected output

print("""
			Count
Cause of Death	
Diseases of Heart ...	147551
Malignant Neoplasms ...	106367
All Other Causes	77999
""")


			Count
Cause of Death	
Diseases of Heart ...	147551
Malignant Neoplasms ...	106367
All Other Causes	77999

