<img src="img/Marca-ITBA-Color-ALTA.png" width="200">

# Programación para el Análisis de Datos

## Clase 3 - Análisis de un dataset

#### Referencias y bibliografía de consulta:

- Python for Data Analysis by Wes McKinney (O’Reilly) 2018 - capítulo 5

- https://pandas.pydata.org/

In [1]:
import numpy as np
import pandas as pd

### Importamos un dataset con pandas

In [2]:
df = pd.read_csv('data/US Presidential Election Results - ResultsByCandidate.csv')

In [3]:
type(df)

pandas.core.frame.DataFrame

### Primera inspección

In [4]:
df.head()
#df.sample(3)
#df.tail()

Unnamed: 0,ElectionYear,CandidateName,HomeState,Incumbent?,CandParty,CandPartyAbbrev,PopularVote,PopVoteShare,ElectoralVotes,ElecVoteShare
0,1788,George Washington,Virginia,N,Independent,I,39624,100.00%,69,100.00%
1,1788,John Adams,Massachusetts,N,Federalist,F,0,0.00%,34,49.28%
2,1788,John Jay,New York,N,Federalist,F,0,0.00%,9,13.04%
3,1788,Robert H. Harrison,Maryland,N,Federalist,F,0,0.00%,6,8.70%
4,1788,John Rutledge,South Carolina,N,Federalist,F,0,0.00%,6,8.70%


In [5]:
df.shape

(358, 10)

In [6]:
list(df.columns)

['ElectionYear',
 'CandidateName',
 'HomeState',
 'Incumbent?',
 'CandParty',
 'CandPartyAbbrev',
 'PopularVote',
 'PopVoteShare',
 'ElectoralVotes',
 'ElecVoteShare']

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ElectionYear     358 non-null    int64 
 1   CandidateName    358 non-null    object
 2   HomeState        358 non-null    object
 3   Incumbent?       358 non-null    object
 4   CandParty        358 non-null    object
 5   CandPartyAbbrev  355 non-null    object
 6   PopularVote      358 non-null    object
 7   PopVoteShare     358 non-null    object
 8   ElectoralVotes   358 non-null    int64 
 9   ElecVoteShare    358 non-null    object
dtypes: int64(2), object(8)
memory usage: 28.1+ KB


In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ElectionYear,358.0,1917.396648,68.789731,1788.0,1868.0,1928.0,1976.0,2020.0
ElectoralVotes,358.0,68.47486,121.136001,0.0,0.0,0.0,83.75,525.0


### Ejercicio

* ¿Qué período de tiempo abarca el dataset?

In [9]:
first_year = df['ElectionYear'].min()
last_year = df['ElectionYear'].max()

print(f"Periodo: {first_year} - {last_year}")

Periodo: 1788 - 2020


### Ejercicio

* ¿Cuál es el partido que más veces aparece en el dataset?

* Seleccionar los partidos que aparecen más de 20 veces

In [10]:
df['CandParty'].unique()

array(['Independent', 'Federalist', 'Anti-Federalist',
       'Democratic-Republican', 'Democratic', 'National Republican',
       'Nullifier', 'Anti-Masonic', 'Whig', 'Liberty (1800s)',
       'Free Soil', 'Union', 'Know Nothing', 'Southern Rights',
       'Republican', 'Constitutional Union', 'Liberal Republican',
       'Straight-Out Democrats', 'Prohibition', 'Greenback',
       'American National', 'Equal Rights', 'Union Labor', 'Populist',
       'Socialist Labor', 'National Democratic', 'Social Democratic',
       'Socialist', 'Independence', 'Progressive', 'Farmer-Labor',
       'American (1920)', 'Single Tax', 'Communist', 'American (1924)',
       'Liberty (1900s)', 'Union (1936)', "States' Rights Democratic",
       'Socialist Workers', 'Constitution', 'American Third',
       'Christian Nationalist', 'Conservative', 'American Independent',
       "People's", 'Libertarian', 'America First', 'American', 'US Labor',
       'Citizens', 'Right to Life', 'Peace and Freedom', 'New

In [11]:
party_frequency = df['CandParty'].value_counts(ascending=False)
party_frequency

CandParty
Democratic                  55
Republican                  43
Prohibition                 26
Federalist                  25
Democratic-Republican       23
                            ..
American (1924)              1
Liberty (1900s)              1
Union (1936)                 1
American Third               1
Socialism and Liberation     1
Name: count, Length: 64, dtype: int64

In [12]:
party_frequency.loc[party_frequency > 20]

CandParty
Democratic               55
Republican               43
Prohibition              26
Federalist               25
Democratic-Republican    23
Name: count, dtype: int64

### Ejercicio

* Armar un dataframe con las columnas candidato, partido y electoral votes y que tenga el año como índice.
* ¿Qué candidato obtuvo el mayor número absoluto de votos en la historia?
* ¿En qué año?

In [13]:
df.columns

Index(['ElectionYear', 'CandidateName', 'HomeState', 'Incumbent?', 'CandParty',
       'CandPartyAbbrev', 'PopularVote', 'PopVoteShare', 'ElectoralVotes',
       'ElecVoteShare'],
      dtype='object')

In [14]:
df_cut = df[['CandidateName', 'CandParty', 'ElectoralVotes', 'ElectionYear']]

df_cut = df_cut.set_index('ElectionYear', drop=True)

df_cut

Unnamed: 0_level_0,CandidateName,CandParty,ElectoralVotes
ElectionYear,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1788,George Washington,Independent,69
1788,John Adams,Federalist,34
1788,John Jay,Federalist,9
1788,Robert H. Harrison,Federalist,6
1788,John Rutledge,Federalist,6
...,...,...,...
2016,Gloria La Riva,Socialism and Liberation,0
2020,Joseph R. Biden Jr.,Democratic,306
2020,Donald J. Trump,Republican,232
2020,Jo Jorgensen,Libertarian,0


In [15]:
df_cut.loc[df_cut['ElectoralVotes'] == df_cut['ElectoralVotes'].max()]

Unnamed: 0_level_0,CandidateName,CandParty,ElectoralVotes
ElectionYear,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1984,Ronald W. Reagan,Republican,525


### Ejercicio

* Partiendo del dataframe generado en el punto anterior, crear uno nuevo que solo contenga las filas de los candidatos demócratas.

* Ordenarlos por cantidad de votos en orden descendente. (Hacerlo por elección, si un candidato estuvo en más de una elección considerarlos como candidatos distintos)

In [16]:
mask_democ = df_cut['CandParty'] == 'Democratic'

df_democ = df_cut.loc[mask_democ]

df_democ.sort_values(by=['ElectoralVotes'], ascending=False)

Unnamed: 0_level_0,CandidateName,CandParty,ElectoralVotes
ElectionYear,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1936,Franklin D. Roosevelt,Democratic,523
1964,Lyndon B. Johnson,Democratic,486
1932,Franklin D. Roosevelt,Democratic,472
1940,Franklin D. Roosevelt,Democratic,449
1912,Woodrow Wilson,Democratic,435
1944,Franklin D. Roosevelt,Democratic,432
1996,Bill Clinton,Democratic,379
1992,Bill Clinton,Democratic,370
2008,Barack H. Obama II,Democratic,365
2012,Barack H. Obama II,Democratic,332


### Ejercicio 

* Crear una nueva columna en el dataset original que contenga el porcentaje de votos en formato numérico
* Calcular el promedio y el desvío estandar del porcentaje de votos obtenido por los demócratas y por los republicanos en la historia

In [17]:
df.head()

Unnamed: 0,ElectionYear,CandidateName,HomeState,Incumbent?,CandParty,CandPartyAbbrev,PopularVote,PopVoteShare,ElectoralVotes,ElecVoteShare
0,1788,George Washington,Virginia,N,Independent,I,39624,100.00%,69,100.00%
1,1788,John Adams,Massachusetts,N,Federalist,F,0,0.00%,34,49.28%
2,1788,John Jay,New York,N,Federalist,F,0,0.00%,9,13.04%
3,1788,Robert H. Harrison,Maryland,N,Federalist,F,0,0.00%,6,8.70%
4,1788,John Rutledge,South Carolina,N,Federalist,F,0,0.00%,6,8.70%


In [18]:
df['ElecVoteShare_num'] = df['ElecVoteShare'].str.replace('%', '')

df['ElecVoteShare_num'] = df['ElecVoteShare_num'].astype(float)
df['ElecVoteShare_num']

0      100.00
1       49.28
2       13.04
3        8.70
4        8.70
        ...  
353      0.00
354     56.88
355     43.12
356      0.00
357      0.00
Name: ElecVoteShare_num, Length: 358, dtype: float64

In [19]:
mask_democ = df['CandParty'] == 'Democratic'

mn_democ = df.loc[mask_democ, 'ElecVoteShare_num'].mean()
sd_democ = df.loc[mask_democ, 'ElecVoteShare_num'].std()


mask_repub = df['CandParty'] == 'Republican'

mn_repub = df.loc[mask_repub, 'ElecVoteShare_num'].mean()
sd_repub = df.loc[mask_repub, 'ElecVoteShare_num'].std()


print(f"Democrats share: {np.round(mn_democ, 2)} +/- {np.round(sd_democ, 2)}")
print(f"Republicans share: {np.round(mn_repub, 2)} +/- {np.round(sd_repub, 2)}")

Democrats share: 42.12 +/- 27.92
Republicans share: 51.93 +/- 26.98
