<a href="https://colab.research.google.com/github/pipuf/ml_dev_cert/blob/main/3_1_1_PRACTICE_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

# 3. Pandas

Pandas is the best known Python library for manipulating and analyzing data. It is built on top of NumPy, so many features are similar. We will use Pandas to work with structured datasets.

Just as NumPy provides us with arrays and with them we access many new features, Pandas provides us with DataFrames and Series. By far the most used object is the first one, DataFrames.

We are going to use the open data of the Argentine government, so you will have to download the csv from the following link: [Names 2010-2014](https://www.datos.gob.ar/dataset/otros-nombres-personas-fisicas)

In [2]:
import pandas as pd

## Reading a csv file

In [3]:
!gdown "1rrunop8AFG7bRdVbQZWXFQAQas581ouV"

df_names = pd.read_csv('nombres-2010-2014.csv')
df_names


Downloading...
From: https://drive.google.com/uc?id=1rrunop8AFG7bRdVbQZWXFQAQas581ouV
To: /content/nombres-2010-2014.csv
  0% 0.00/19.6M [00:00<?, ?B/s] 78% 15.2M/19.6M [00:00<00:00, 150MB/s]100% 19.6M/19.6M [00:00<00:00, 152MB/s]


Unnamed: 0,nombre,cantidad,anio
0,Benjamin,2986,2010
1,Sofia,2252,2010
2,Bautista,2176,2010
3,Joaquín,2111,2010
4,Juan Ignacio,2039,2010
...,...,...,...
871489,Leire Jasmin,1,2014
871490,Isaias Sebastian Ariel,1,2014
871491,Yanira Valentina,1,2014
871492,Angie Ainara,1,2014


In [8]:
df_names.dtypes

Unnamed: 0,0
name,object
amount,int64
year,int64


## Columns renaming

First of all let's rename the columns to `name`, `amount` and `year`

In [4]:
df_names.rename(columns={'nombre': 'name', 'cantidad': 'amount', 'anio': 'year'}, inplace=True)
df_names

Unnamed: 0,name,amount,year
0,Benjamin,2986,2010
1,Sofia,2252,2010
2,Bautista,2176,2010
3,Joaquín,2111,2010
4,Juan Ignacio,2039,2010
...,...,...,...
871489,Leire Jasmin,1,2014
871490,Isaias Sebastian Ariel,1,2014
871491,Yanira Valentina,1,2014
871492,Angie Ainara,1,2014


## Some Pandas useful functions

**TODO:** Investigate the functions that are implemented in the next cell. What do they do? What do you think they can be useful for?

Answer
.shape shows amt of rows, amt of columns


In [19]:
df_names.head()
# df_names.tail()
# df_names.tail()
df_names.count()
# df_names.shape

Unnamed: 0,0
name,871494
amount,871494
year,871494
amount_chars,871494


## Append a new row

**TODO:** Suppose that in the data load, someone forgot to add a name and its respective amount and year.

Let's add to our dataset the following row with said information:

Name: "Daenerys Stormborn of the House Targaryen, First of Her Name de ella, the Unburnt, Queen of the Andals and the First Men, Khaleesi of the Great Grass Sea, Breaker of Chains, and Mother of Dragons"

Amount: 100
Year: 2011

In [7]:
# Complete this cell with your code
new_row = pd.DataFrame([{"name": "Daenerys Stormborn of the House Targaryen, First of Her Name de ella, the Unburnt, Queen of the Andals and the First Men, Khaleesi of the Great Grass Sea, Breaker of Chains, and Mother of Dragons", "amount":100,"year":2011}])
df_names = pd.concat([df_names, new_row], ignore_index=True)
df_names.tail()

Unnamed: 0,name,amount,year
871490,Isaias Sebastian Ariel,1,2014
871491,Yanira Valentina,1,2014
871492,Angie Ainara,1,2014
871493,Elias Hernando,1,2014
871494,"Daenerys Stormborn of the House Targaryen, Fir...",100,2011


**TODO:** Investigate the columns and index functions. What do they do? What data type is their output? What known data type do they resemble?

In [15]:
df_names.columns

Index(['name', 'amount', 'year'], dtype='object')

In [16]:
df_names.index

RangeIndex(start=0, stop=871495, step=1)

## Add a new column

**TODO:** Add a column to the dataframe that corresponds to the number of characters in each name

In [37]:
# Complete this cell with your code
df_names['amount_chars'] = df_names['name'].str.len()
df_names.sort_values("amount_chars",ascending=False,inplace=True)
df_names.head()


Unnamed: 0,name,amount,year,amount_chars
871355,Malena Niicole ...,1,2014,100
517996,Daniel Arnaldo ...,2,2013,100
575030,Samira Guadalupe ...,1,2013,100
583719,Luz Mila Milagro ...,1,2013,100
234320,Nahiara Camila Jazmín ...,1,2011,100


## Filtering by mask

Its implementation is very similar in both NumPy and Pandas, so we will see how to do it first in NumPy then in Pandas.

Suppose we make 100 rolls of a die, but we want to select only those rolls that were less than four. How can we do it?

In [22]:
import numpy as np
dice = np.random.randint(1, 7, size=100)
print(dice)

[1 4 2 4 1 5 4 1 2 2 5 6 4 2 3 5 1 1 6 6 4 2 4 1 1 3 6 6 5 4 5 3 2 1 5 1 2
 3 2 6 6 6 6 5 5 3 3 4 1 5 3 6 3 1 2 1 4 5 4 2 6 2 3 3 6 5 2 4 2 6 6 5 6 1
 3 2 1 2 1 5 1 1 4 4 4 1 3 3 3 6 6 6 5 5 5 3 3 6 2 5]


What we can do is create a mask:

In [23]:
mask = dice > 3
print(mask)
print(type(mask))

[False  True False  True False  True  True False False False  True  True
  True False False  True False False  True  True  True False  True False
 False False  True  True  True  True  True False False False  True False
 False False False  True  True  True  True  True  True False False  True
 False  True False  True False False False False  True  True  True False
  True False False False  True  True False  True False  True  True  True
  True False False False False False False  True False False  True  True
  True False False False False  True  True  True  True  True  True False
 False  True False  True]
<class 'numpy.ndarray'>


In [24]:
print(dice[mask])

[4 4 5 4 5 6 4 5 6 6 4 4 6 6 5 4 5 5 6 6 6 6 5 5 4 5 6 4 5 4 6 6 5 4 6 6 5
 6 5 4 4 4 6 6 6 5 5 5 6 5]


In [25]:
print(dice.sum())

353


In [26]:
print(dice[dice > 3])

[4 4 5 4 5 6 4 5 6 6 4 4 6 6 5 4 5 5 6 6 6 6 5 5 4 5 6 4 5 4 6 6 5 4 6 6 5
 6 5 4 4 4 6 6 6 5 5 5 6 5]


**TODO:** Going back to our dataset, suppose we want to keep those rows with names that were repeated more than 2000 times in the corresponding year. Note that in the result a name may appear more than once in different years

In [40]:
# Complete this cell with your cod
mask = df_names['amount']> 2000
df_names[mask]



Unnamed: 0,name,amount,year,amount_chars
457931,Thiago Benjamin,2107,2013,15
4,Juan Ignacio,2039,2010,12
662401,Juan Ignacio,2260,2014,12
457930,Juan Ignacio,2198,2013,12
254439,Juan Ignacio,2632,2012,12
457927,Valentina,2450,2013,10
662403,Francisco,2140,2014,9
662393,Valentino,2702,2014,9
662398,Valentina,2432,2014,9
254440,Valentina,2564,2012,9


:**TODO:** What if we want to select those names with more than 8 characters and from 2010 onwards?

In [42]:
# Complete this cell with your code
mask= (df_names['amount_chars']>8) & (df_names['year']>=2010)
df_names[mask]


Unnamed: 0,name,amount,year,amount_chars
871355,Malena Niicole ...,1,2014,100
517996,Daniel Arnaldo ...,2,2013,100
575030,Samira Guadalupe ...,1,2013,100
583719,Luz Mila Milagro ...,1,2013,100
234320,Nahiara Camila Jazmín ...,1,2011,100
...,...,...,...,...
871369,Wanda Mia,1,2014,9
123,Maria Pia,291,2010,9
104,Constanza,337,2010,9
114,Maria Paz,318,2010,9


## Statistics

**TODO:** Obtain the mean value and standard deviation of each numeric column. Is there a function in Pandas that will give us even more statistics?

In [45]:
# Complete this cell with your code
df_names.describe()


Unnamed: 0,amount,year,amount_chars
count,871494.0,871494.0,871494.0
mean,4.278225,2012.267647,14.356866
std,34.615712,1.370919,3.672557
min,1.0,2010.0,2.0
25%,1.0,2011.0,12.0
50%,1.0,2012.0,14.0
75%,2.0,2013.0,16.0
max,4960.0,2014.0,100.0


In [46]:
df_names.sort_values('amount',ascending=False).head()

Unnamed: 0,name,amount,year,amount_chars
457917,Benjamin,4960,2013,8
254431,Benjamin,4724,2012,8
662387,Benjamin,4286,2014,8
457918,Isabella,3587,2013,8
662388,Martina,3563,2014,7


## Delete a column

**TODO:** Delete the column `amount_chars` from the dataframe.

In [47]:
# Complete this cell with your code
df_names.drop(columns=['amount_chars'])

Unnamed: 0,name,amount,year
871355,Malena Niicole ...,1,2014
517996,Daniel Arnaldo ...,2,2013
575030,Samira Guadalupe ...,1,2013
583719,Luz Mila Milagro ...,1,2013
234320,Nahiara Camila Jazmín ...,1,2011
...,...,...,...
111922,Jad,1,2010
36454,Ani,2,2010
662562,Pía,325,2014
662561,Luz,328,2014


## Sorting by column

**TODO:** Sort the dataframe by `amount` and descending

In [48]:
# Complete this cell with your code
df_names.sort_values('amount',ascending=False)

Unnamed: 0,name,amount,year,amount_chars
457917,Benjamin,4960,2013,8
254431,Benjamin,4724,2012,8
662387,Benjamin,4286,2014,8
457918,Isabella,3587,2013,8
662388,Martina,3563,2014,7
...,...,...,...,...
547710,Ina,1,2013,3
186243,Ari,1,2011,3
785097,Sam,1,2014,3
803871,Macarena Del Valle (presunto 38607523),1,2014,39


## Pandas groupby and plot

**TODO:** Group the number of names by `year` and plot it using vertical bars

In [15]:
# Este anda
# Complete this cell with your code
df_names.groupby('name')['amount'].sum().sort_values(ascending=False)

Unnamed: 0_level_0,amount
name,Unnamed: 1_level_1
Benjamin,19491
Martina,14068
Isabella,13245
Bautista,12618
Catalina,12376
...,...
Úrsula Indira,1
Úrsula Julieta,1
Úrsula Lara,1
Úrsula Leonor,1


In [18]:
# Este no anda
# Complete this cell with your code
df_names.groupby('name','year')['amount'].sum().sort_values(ascending=False)

ValueError: No axis named year for object type DataFrame