# Metodos que vamos a usar en el examen

* Groupby
* Pivot
* Merge
* Append
* Map y Apply
* iloc

## Ejercicio 1

Nos solicitan revisar si es posible establecer una relación entre el país y la demanda de ciertos productos. La idea es que conociendo el país se pueda predecir las cantidades de los top tres productos para cada país (su precio, la cantidad de unidades vendidas) esto agrupado por mes.

* 1. Unidad muestral: País.
* 2. Quantity variable auxiliar para el cálculo de la demanda.
* 3. Segunda variable auxiliar precio.
* 4. Esto para los 3 productos más vendidos por cada país. 
* 5. Agrupación a nivel mes.

In [1]:
"""Importamos la data"""
import pandas as pd
import os

pd.set_option('display.max_columns', 500)

"""Para acceder al directorio donde estamos se usa os.getcwd() y le quitamos el directorio notebooks"""
data_path = os.getcwd()[:-len('notebooks')] + 'Data/'
"""Una vez en la carpeta Data accedemos al archivo"""
transactional_data = data_path + 'Business Sales Transaction.csv'

"""Importamos el archivo con pandas"""
sales_ecommerce = pd.read_csv(transactional_data, low_memory = False)
sales_ecommerce.head()

Unnamed: 0,TransactionNo,Date,ProductNo,ProductName,Price,Quantity,CustomerNo,Country
0,536365,2018-12-01,85123A,Cream Hanging Heart T-Light Holder,1.88,6,17850.0,United Kingdom
1,536365,2018-12-01,71053,White Moroccan Metal Lantern,2.01,6,17850.0,United Kingdom
2,536365,2018-12-01,84406B,Cream Cupid Hearts Coat Hanger,1.91,8,17850.0,United Kingdom
3,536365,2018-12-01,84029G,Knitted Union Flag Hot Water Bottle,2.01,6,17850.0,United Kingdom
4,536365,2018-12-01,84029E,Red Woolly Hottie White Heart,2.01,6,17850.0,United Kingdom


In [2]:
"""Copiamos el dataframe dentro de uno nuevo"""
df = sales_ecommerce.copy()
"""Creamos una nueva columna llamada llave donde en ella usaremos los
primeros 6 digitos de la fecha recordando que python itera desde el 0
y [:7] indica que va del 0 al digito 6"""
"""El metodo map nos permite que a un tipo series de pandas aplicarle
una función en una sola línea, en este caso el x.strip()[:7] lo que
hace es que primero eliminamos en cada uno de los valores de cada fila
de date los espacios a los extremos y seguido de esos datos sin espacio
tomamos los primeros 6"""
df["key"]=df['Date'].map(lambda x:x.strip()[:7])
df.head(3)

Unnamed: 0,TransactionNo,Date,ProductNo,ProductName,Price,Quantity,CustomerNo,Country,key
0,536365,2018-12-01,85123A,Cream Hanging Heart T-Light Holder,1.88,6,17850.0,United Kingdom,2018-12
1,536365,2018-12-01,71053,White Moroccan Metal Lantern,2.01,6,17850.0,United Kingdom,2018-12
2,536365,2018-12-01,84406B,Cream Cupid Hearts Coat Hanger,1.91,8,17850.0,United Kingdom,2018-12


Observaciones el la funcion strip() es para eliminar espacioes en blanco y es recomendable usarla cuando usamos iteraciones sobre strings para evitar errores

In [3]:
"""En esta parte lo que hacemos es que de un subconjunto del data frame
solo tomamos los valores de la columna country y key, ahi agrupamos
por country es decir ese sera nuestro indice y vamos a contar cuantas
llaves tiene cada pais"""
df[['Country', 'key']].groupby('Country').count().head(3)

Unnamed: 0_level_0,key
Country,Unnamed: 1_level_1
Australia,1704
Austria,887
Bahrain,17


In [4]:
"""Con base a la instruccioón anterior vemos que arroja un dataframe,
a ese dataframe le vamos a agarrar los indices y los vamos a usar como
indices en nuestro nuevo dataframe final"""
dffinal = pd.DataFrame(index = df[['Country', 'key']].groupby('Country').count().index)
dffinal.head(3)

Australia
Austria
Bahrain


In [5]:
"""Pasamos a filtrar la información, usando un groupby partiendo a
un sub dataframe solo con country,key,ProductNo,Quantity y Price
de esos agrupanos prinero por pais, luego por llave la cual es el 
año -mes y por el productNo, de ahi partimos a sacar las funciones de
agregación en donde es Titulo:Funcion, es decir en Quantity vamos a 
sumar los registros y en price vamos a a sacar la media"""
# Filtrado de características
aux = (df[['Country', 'key', 'ProductNo', 'Quantity', 'Price']] # En esta línea está el filtrado
       .groupby(['Country', 'key', 'ProductNo']).agg({'Quantity': 'sum', 'Price': 'mean'}))
aux

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Quantity,Price
Country,key,ProductNo,Unnamed: 3_level_1,Unnamed: 4_level_1
Australia,2018-12,20685,4,2.69
Australia,2018-12,20725,10,1.75
Australia,2018-12,21217,-1,2.99
Australia,2018-12,21622,8,2.24
Australia,2018-12,21791,12,1.69
...,...,...,...,...
Unspecified,2019-09,84378,2,1.72
Unspecified,2019-09,84991,2,1.58
Unspecified,2019-09,84997C,2,2.12
Unspecified,2019-09,85019B,2,1.69


In [6]:
"""Ahora pasamos al proceso de extracción en donde vamos a usar
diccionarios para llenar el data frame, procedemos llenando la variable
keys con un lista de todos los meses llenandolo con un for,
Y se crea una lista rows que se llenará en un for"""

keys = ['2018-12'] + [f'2019-0{i}' for i in range(1,10,1)] + [f'2019-{i}' for i in range(10,13,1)]
rows = []

for f in dffinal.index: #Empezamos iterando sobre los indices finales, que son los paises
    l = [] #Creamos una lista vacia que vamos a llenar con la tupla indicada
    for k in keys: #Iteramos sobre las llaves para llenar en cada pais todas las fechas 
        try: #Hacemos que corra el codigo de abajo cuando no haya errores
            """Abajo lo que hacemos es que con el auxiliar creado arriba, que es
            un data frame, vamos a usar el iloc para filtrar la info, pedimos que
            como es un multi-index, nos traiga los objetos con pais f y fecha k,
            de esos datos vamos ordenas los valores de manera no ascendente en la
            columna quantity para seleccionar a los más vendidos usando el head(3)
            que se traduce a obtener el top 3. Y en aux_ vamos a meter los precios y en
            aux2 vamos a meter las cantidades.
            Al momento de vaciarlo, lo haremos en una lista, que estara conformada por
            tuplas, en esta tupla el primer elemento es el Numero del producto,
            la cantidad y el precio dado que el .get nos da el valor 
            dandole la llave del diccionario"""
            aux_ = dict(aux.loc[(f, k)].sort_values(by=('Quantity'), ascending = False).head(3)['Quantity'])
            aux_2 = dict(aux.loc[(f, k)].sort_values(by=('Quantity'), ascending = False).head(3)['Price'])
            l.append([(i, aux_.get(i), aux_2.get(i)) for i in aux_])
        except:
            l.append([(0,0,0) for i in range(3)])
    rows.append(l)

In [7]:
#ejemplo 
prueba = dict(aux.loc[("Australia","2018-12")].sort_values(by = "Quantity", ascending = False).head(3)["Price"])
prueba

{'22915': 1.55, '79067': 1.94, '22196': 1.63}

In [8]:
lista = []
lista.append([(i,prueba.get(i)) for i in prueba])
lista

[[('22915', 1.55), ('79067', 1.94), ('22196', 1.63)]]

In [9]:
for i in prueba:
    print(i)

22915
79067
22196


In [10]:
rows[:2]

[[[('22915', 120, 1.55), ('79067', 50, 1.94), ('22196', 48, 1.63)],
  [('22492', 576, 1.58),
   ('21915', 252, 1.6749999999999998),
   ('22720', 243, 2.1900000000000004)],
  [('22969', 480, 1.69), ('20973', 384, 1.58), ('22962', 384, 1.61)],
  [('22615', 432, 1.54), ('21984', 432, 1.54), ('21981', 432, 1.54)],
  [('20725', 30, 1.75), ('22662', 30, 1.75), ('22384', 30, 1.75)],
  [('15036', 600, 1.61), ('21902', 576, 1.58), ('21900', 576, 1.58)],
  [('22492', 576, 1.58), ('16161P', 400, 1.55), ('22704', 400, 1.55)],
  [('23295', 408, 1.6175000000000002),
   ('23293', 408, 1.6175000000000002),
   ('23296', 408, 1.6824999999999999)],
  [('21731', 720, 1.72), ('22492', 576, 1.58), ('22940', 336, 2.06)],
  [('22492', 1152, 1.58),
   ('21915', 732, 1.6749999999999998),
   ('22751', 240, 2.01)],
  [('22722', 360, 2.02), ('23507', 250, 1.55), ('23510', 250, 1.55)],
  [('23084', 1632, 1.77), ('23247', 216, 1.93), ('23234', 216, 1.5)],
  [(0, 0, 0), (0, 0, 0), (0, 0, 0)]],
 [[('22153', -48, 1.56)

Observación podemos ver que cuando iteramos sobre un duccionario solo nos va a dar la llave, para obtener el value es con .get

In [11]:
"""Ahora solo llenamos el data frame con los datos obtenidos, en 
donde rows son los valores que estan en la lista de tuplas por
pais y año y esto se ve porque es una lista dentro de dos listotas"""
dff = pd.DataFrame(rows, index = dffinal.index, columns = keys)
dff

Unnamed: 0_level_0,2018-12,2019-01,2019-02,2019-03,2019-04,2019-05,2019-06,2019-07,2019-08,2019-09,2019-10,2019-11,2019-12
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Australia,"[(22915, 120, 1.55), (79067, 50, 1.94), (22196...","[(22492, 576, 1.58), (21915, 252, 1.6749999999...","[(22969, 480, 1.69), (20973, 384, 1.58), (2296...","[(22615, 432, 1.54), (21984, 432, 1.54), (2198...","[(20725, 30, 1.75), (22662, 30, 1.75), (22384,...","[(15036, 600, 1.61), (21902, 576, 1.58), (2190...","[(22492, 576, 1.58), (16161P, 400, 1.55), (227...","[(23295, 408, 1.6175000000000002), (23293, 408...","[(21731, 720, 1.72), (22492, 576, 1.58), (2294...","[(22492, 1152, 1.58), (21915, 732, 1.674999999...","[(22722, 360, 2.02), (23507, 250, 1.55), (2351...","[(23084, 1632, 1.77), (23247, 216, 1.93), (232...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]"
Austria,"[(22153, -48, 1.56)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(21918, 288, 1.55), (22546, 240, 1.55), (2258...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(22492, 36, 1.6), (22693, 24, 1.69), (21723, ...","[(22697, -1, 1.94), (22357, -5, 2.14)]","[(21500, 25, 1.56), (23545, 25, 1.56), (21498,...","[(21402, 24, 1.52), (21403, 24, 1.52), (22814,...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(72817, 24, 1.62), (21788, 24, 1.63), (21789,...","[(23084, 279, 2.12), (22154, 49, 1.59), (23353...","[(22646, 36, 1.56), (15056N, 24, 2.39), (20679..."
Bahrain,"[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(23076, 96, 1.69), (23077, 60, 1.69), (22693,...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]"
Belgium,"[(21212, 148, 1.61), (22417, 144, 1.57), (2197...","[(22489, 144, 1.55), (22540, 48, 1.56), (22544...","[(22740, 48, 1.63), (22540, 48, 1.56), (22531,...","[(20724, 110, 1.62), (22355, 100, 1.61), (2066...","[(21212, 72, 1.58), (21977, 72, 1.58), (21829,...","[(84991, 96, 1.58), (22951, 96, 1.58), (21976,...","[(22492, 108, 1.6), (22740, 96, 1.63), (20724,...","[(84536B, 288, 1.5550000000000002), (22027, 72...","[(23546, 50, 1.56), (23545, 50, 1.56), (21981,...","[(23310, 180, 1.55), (22326, 90, 1.93), (22328...","[(22951, 96, 1.58), (21212, 96, 1.58), (22417,...","[(22629, 112, 1.782), (23084, 108, 1.79), (223...","[(21499, 50, 1.56), (79190B, 27, 1.595), (1615..."
Brazil,"[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(22630, 24, 1.79), (22697, 24, 1.88), (22993,...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]"
Canada,"[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(84755, 16, 1.6), (20886, 12, 1.79), (71459, ...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(37370, 504, 1.66)]","[(84077, 288, 1.53), (20727, 50, 1.75), (10133...","[(47593B, 48, 1.56), (22098, 36, 1.56), (22099...","[(23293, 16, 1.62), (21993, 12, 1.69), (23294,...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]"
Channel Islands,"[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(16161P, 25, 1.56), (21498, 25, 1.56), (21500...","[(21785, 407, 1.6), (22720, 96, 2.14), (22745,...","[(22740, 96, 1.63), (22741, 48, 1.63), (22151,...","[(22908, 24, 1.63), (23206, 20, 1.75), (22666,...","[(21499, 25, 1.56), (21500, 25, 1.56), (21498,...","[(22985, 25, 1.56), (21212, 24, 1.58), (21977,...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(85099B, 210, 1.79), (85099C, 200, 1.77), (23...","[(22741, 96, 1.63), (22197, 48, 1.63), (22492,...","[(85123A, 72, 1.94), (23327, 48, 1.59), (22952...","[(22771, 36, 1.69), (84992, 24, 1.58), (84991,...","[(23503, 24, 1.69), (23355, 20, 2.24), (22086,..."
Cyprus,"[(22335, 96, 1.6), (85123A, 64, 1.88), (22297,...","[(82613A, 60, 1.5599999999999998), (82484, 12,...","[(23230, 25, 1.56), (22709, 25, 1.56), (22616,...","[(84598, 288, 1.53), (84568, 288, 1.53), (8519...","[(22666, -2, 1.94), (22839, -2, 3.74)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(23182, 24, 1.62), (22993, 24, 1.69), (22908,...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(85204, 96, 1.52), (22066, 48, 1.56), (22153,...","[(16238, 28, 1.53), (22818, 24, 1.56), (23500,...","[(22720, -1, 2.24), (22826, -1, 7.88), (22797,..."
Czech Republic,"[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(84755, 48, 1.6), (22231, 36, 1.72), (22250, ...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(22231, -15, 1.72), (84459A, -24, 1.72)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(22578, 72, 1.54), (22579, 72, 1.54), (23271,...","[(22231, -15, 1.72), (84459A, -24, 1.72)]","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]"
Denmark,"[(22951, 120, 1.56), (22956, 48, 1.78), (22779...","[(0, 0, 0), (0, 0, 0), (0, 0, 0)]","[(22538, 24, 1.56), (22539, 24, 1.56), (22544,...","[(22467, 144, 1.81), (22693, 144, 1.66), (2217...","[(21212, 24, 1.58), (21977, 24, 1.58), (23076,...","[(22630, 36, 1.79), (22629, 36, 1.79), (22556,...","[(22383, 125, 1.7349999999999999), (84077, 96,...","[(22820, 24, 1.6), (23290, 24, 1.69), (23289, ...","[(16161P, 25, 1.56), (22704, 25, 1.56), (23546...","[(21915, 240, 1.66), (20713, 100, 1.77), (2320...","[(23296, 256, 1.66), (23295, 256, 1.61), (2329...","[(23186, 48, 1.54), (21175, 48, 1.81), (82582,...","[(16237, 60, 1.53), (22045, 25, 1.56), (22708,..."


## Ejercicio 2

Como comité olímpico, nos solicitan analizar la relación que existe entre el número de participantes enviados por cada país y el número de medallas obtenidos en los tres niveles. 

Nos interesa analizar este proceso a nivel país, y nos gustaría poder revisar la cantidad de atletas enviados por cada país (cantidad total, cantidad de mujeres y cantidad de hombres), total de medallas obtenidas, total de medallas de oro, total de medallas de plata y total de medallas de bronce (segmentados por hombres y mujeres). Además de esto, nos interesa contar la cantidad de coaches que envió cada país, de igual forma, divido por mujeres y hombres. Y por último agregar las columnas para los referies, la intención es analizar si la cantidad de personas enviadas por cada país está relacionada con la cantidad de medallas traidas por cada país. 

In [12]:
"""Procedemos a importar la data de los juegos de beijin que se
encuentran dentro una carpeta en Data"""
data_path = os.getcwd()[:-len('notebooks')] + 'Data/'
db_beijing_path = data_path + 'Beijing_winter_Olympic_Games/'

Hay que definir las tablas o los datos que usaremos, vamos a necesitar 
medallas, atletas, coaches y referies

In [13]:
"""Importamos cada una de las tablas previamente descritas"""
athletes = pd.read_csv(db_beijing_path + 'athletes.csv')
officials = pd.read_csv(db_beijing_path + 'technical_officials.csv')
coaches = pd.read_csv(db_beijing_path + 'coaches.csv')
medals = pd.read_csv(db_beijing_path + 'medals.csv')

Primero vamos a crear la columna principal de nuestra tabla TAD que son los paises aqui cabe reslara que lo haremos con los paises de cada tabla juntandolo en una sola ya que puede haber paises que no hayan mandado atletas, referies o coaches y paises que no ganaron medallas

In [14]:
"""Creamos una lista, que contiene un conjunto de las listas con paises
sin repeticion usando unique()"""

# Creamos la lista de todos los países
countries = list(set(list(athletes['country'].unique()) 
                     + list(officials['country'].unique())
                     + list(coaches['country'].unique())))
#No tomo en cuenta medallas dado que es un subconjunto de los paises de atletas
"""Ahora esa lista la vamos a convertir en una serie de pandas y 
se va a ordenas por orden alfabetico"""
countries = pd.Series(countries).sort_values()
countries

83                     Albania
4               American Samoa
70                     Andorra
31                   Argentina
15                     Armenia
                ...           
89                      Turkey
85                     Ukraine
57    United States of America
32                  Uzbekistan
8           Virgin Islands, US
Length: 92, dtype: object

Procedemos ahora a analizar la primer tabla que es atletas ya que es la mas grande

In [15]:
"""Ahora vamos a comprobar que la columna de genero tenga valores
únicos para que esto no afecte al momento de hacer nuestro conteo"""
athletes['gender'].unique()

array(['Male', 'Female', 'F', 'M'], dtype=object)

In [16]:
"""Vemos que los generos son Male, Female, F y M, lo cual nos obliga a
normalizar a la variable asignanfo los F y M a Male o Female, esto lo haremos
mapenado los valores en el data frame"""

aux_athletes_dict = {
    'Male' : 'Male',
    'Female': 'Female',
    'F': 'Female',
    'M': 'Male'
} #Se crea el diccionario donde reasignamos las variables y con este vamos a hacer el mapeo

"""Ahora en la columna gender vamos a rellenas los generos ocupando map
para hacer la sustitucion"""
athletes['gender'] = athletes['gender'].map(aux_athletes_dict)
athletes['gender'].unique()

array(['Male', 'Female'], dtype=object)

Observación el map funciona con un diccionario en el que el key es el valor que rastrea y el value es por el que lo sustituye.

Procedemos a ver si este error en los generos se repite en las otras tablas:

In [17]:
# Revisamos que la variable tenga valores únicos
officials['gender'].unique()

array(['Female', 'Male'], dtype=object)

In [18]:
# Revisamos que la variable tenga valores únicos
coaches['gender'].unique()

array(['Male', 'Female'], dtype=object)

In [19]:
# Revisamos que la variable tenga valores únicos
medals['athlete_sex'].unique()

array(['X', 'W', 'M', 'O'], dtype=object)

**Para esta situación en particular, tendremos que idear una alternativa, para poder obtener el sexo del atleta. Para esto utilizaremos diccionarios, los cuales crearemos a partir del dataframe de atletas.**

In [20]:
#Copiamos el data frame de medals en uno nuevo donde haremos el rastreo de los sexos de los atletas

medals_ = medals.copy()

In [21]:
"""Ahora procederemos a crear el la nueva columna en medals_ donde veremos
el genero de los ganadores por medalla"""

dictionaries = dict()
vars_ = ['name', 'gender', 'country'] # Variables de interés
vars_f = ['name', 'gender'] # Llave y valor para el diccionario
f_var = 'country' # Variable de filtrado

"""Ahora en este for lo que haremos es por cada ciudad el diccionario lo vamos a
actualizar.
De la tabla de atletas agarramos el subconjunto con solo las columnas de
vars_ ahi con loc filtramos a los atltas que solo sean del pais c.
Y vamos a agarrar de ahi los valoes de nombre y genero"""

for c in countries:
    dictionaries.update(dict(athletes[vars_].loc[athletes[f_var] == c][vars_f].values))
dictionaries

{'XHEPA Denni': 'Male',
 'CRUMPTON Nathan': 'Male',
 'ESTEVE ALTIMIRAS Ireneu': 'Male',
 'ESTEVEZ Maeva': 'Female',
 'MORENO Cande': 'Female',
 'VERDU Joan': 'Male',
 'VILA OBIOLS Carola': 'Female',
 'BARUZZI FARRIOL Francesca': 'Female',
 'BIRKNER de MIGUEL Tomas': 'Male',
 'DAL FARRA Franco': 'Male',
 'DIAZ GONZALEZ Nahiara': 'Female',
 'RAVENNA Veronica Maria': 'Female',
 'RODRIGUEZ LOPEZ Maria Victoria': 'Female',
 'GALSTYAN Katya': 'Female',
 'GARABEDIAN Tina': 'Female',
 'HARUTYUNYAN Harutyun': 'Male',
 'MIKAYELYAN Mikayel': 'Male',
 'MURADYAN Angelina': 'Female',
 'PROULX SENECAL Simon': 'Male',
 'ANTHONY Jakara': 'Female',
 'ARTHUR Emily': 'Female',
 'ASH Gabi': 'Female',
 'ASH Sophie': 'Female',
 'BAFF Josie': 'Female',
 'BELLINGHAM Phil': 'Male',
 'BOLTON Cameron': 'Male',
 'BROCKHOFF Belle': 'Female',
 'COADY Tess': 'Female',
 'COREY Brendan': 'Male',
 'COX Britteny': 'Female',
 'COX Matthew': 'Male',
 'CRAINE Kailani': 'Female',
 'de CAMPO Seve': 'Male',
 'DICKSON Adam': 'M

In [22]:
"""Ya que tenemos el diccionario donde tenemos nombre y genero, vamos
a mapear en una nueva columna de medals_["gender"] a los atletas ganadores"
"""
medals_['gender']=medals_['athlete_name'].map(dictionaries)
medals_['gender'].unique()

array(['Female', 'Male'], dtype=object)

In [23]:
"""Como necesitamos contra el numero de atletas hombre y mujeres agrupado
por pais, vamos a usar el metodo pivot que por columnas separadas nos
da el conteo.
Lo que hacemos es de la df de atletas solo tomamos el country, gender y name
quitamos datos duplicados si es que hay, y con pivot_table decimos
que nuestra agrupacion sera country y vamos a seperar por genero
Vamos usar un conteo ya que es variable discreta y en values ponemos el name"
"""

# 1. Athletes
aux_athletes = (athletes[['country', 'gender', 'name']]
                .drop_duplicates()
                .pivot_table(index = 'country', columns = 'gender', aggfunc = 'count', fill_value = 0, 
                             values = 'name'))
aux_athletes.head(3)

gender,Female,Male
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Albania,0,1
American Samoa,0,1
Andorra,3,2


In [24]:
"""Ahora se cambia el nombre de las columnas donde primero se hace minusculas
las letras es decir de Female a female y se le agrega _athletes"""

aux_athletes.columns = [c.lower() + '_athletes' for c in aux_athletes.columns]
aux_athletes.head(3)

Unnamed: 0_level_0,female_athletes,male_athletes
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Albania,0,1
American Samoa,0,1
Andorra,3,2


In [25]:
"""El anterior proceso se repite para las otras tablas"""

# 2. Officials
aux_officials = (officials[['country', 'gender', 'name']]
                 .drop_duplicates()
                 .pivot_table(index = 'country', columns = 'gender', aggfunc = 'count', fill_value = 0,
                            values = 'name'))
aux_officials.columns = [c.lower() + '_officials' for c in aux_officials.columns]
# 3. Coaches
aux_coaches = (coaches[['country', 'gender', 'name']]
               .drop_duplicates()
               .pivot_table(index = 'country', columns = 'gender', aggfunc = 'count', fill_value = 0,
                            values = 'name'))
aux_coaches.columns = [c.lower() + '_coaches' for c in aux_coaches.columns]
# 4. Medals
aux_medals = (medals_[['country', 'gender', 'athlete_name']]
                .pivot_table(index = 'country', columns = 'gender', aggfunc = 'count', fill_value = 0,
                            values = 'athlete_name'))
aux_medals.columns = [c.lower() + '_total_medals' for c in aux_medals.columns]

In [26]:
"""Para las medallas de hombre y mujeres es el mismo proceso solo que
antes de usar el pivot_table filtramos por el genero y seguimos el mismo
proceso"""
# 5. Female Medals
aux_female_medals = (medals_[['country', 'gender', 'medal_type']].loc[medals_['gender'] == 'Female']
                     .pivot_table(index = 'country', columns = 'medal_type', aggfunc = 'count',
                                  values = 'gender', fill_value = 0))
aux_female_medals.columns = [f'female_{c.lower()}_medals' for c in aux_female_medals.columns]
# 6. Male Medals
aux_male_medals = (medals_[['country', 'gender', 'medal_type']].loc[medals_['gender'] == 'Male']
                     .pivot_table(index = 'country', columns = 'medal_type', aggfunc = 'count',
                                  values = 'gender', fill_value = 0))
aux_male_medals.columns = [f'male_{c.lower()}_medals' for c in aux_male_medals.columns]

In [27]:
"""Para lo atltas totales no queremos que nos separe nada en columnas,
solo queremos un conteo por lo cual haremos un groupby
Primero seleccionamos solo la columna country y name, eliminamos valores
repetidos y agrupamos por el pais para que quede como indice y contamos
sobre la columna name
"""
aux_total_athletes = pd.DataFrame(athletes[['country', 'name']].drop_duplicates().groupby('country').count()['name'])
aux_total_athletes.columns = ['athletes']

In [28]:
"""Creamos el primer data frame que es el base y agregamos la columna
country y la llenamos con el indice que es la lista de paises."""
df = pd.DataFrame(index = countries)
df['country'] = df.index
df.head(3)

Unnamed: 0,country
Albania,Albania
American Samoa,American Samoa
Andorra,Andorra


In [29]:
"""El indice lo reseteamos borrando los valores y poniendolos como numeros"""
df.reset_index(inplace = True, drop = True)
df.head(3)

Unnamed: 0,country
0,Albania
1,American Samoa
2,Andorra


In [30]:
"""Ahora se unen las tablas con un merge al df principal, haciendo
un left join dado que tomaremos como referencia la tabla de la izquierda,
todas en country en left y right on, menos en medallas ya que ahi solo
tenemos el indice entonces ponemos right_index = True,
agregamos sufijos dependiendo la tabla"""

df = (df.merge(aux_athletes, how = 'left', left_on = 'country', right_on = 'country', suffixes=('', '_athletes'))
      .merge(aux_officials, how = 'left', left_on = 'country', right_on = 'country', suffixes=('', '_officials'))
      .merge(aux_coaches, how = 'left', left_on = 'country', right_on = 'country', suffixes=('', '_coaches'))
      .merge(aux_medals, how = 'left', left_on = 'country', right_on = 'country', suffixes=('', '_medals'))
      .merge(aux_female_medals, how = 'left', left_on = 'country', right_index = True, suffixes=('', '_medals'))
      .merge(aux_male_medals, how = 'left', left_on = 'country', right_index = True, suffixes=('', '_medals'))
      .merge(aux_total_athletes, how = 'left', left_on = 'country', right_index = True, suffixes = ('', '_tathletes'))
      .fillna(0))
df.head(3)

Unnamed: 0,country,female_athletes,male_athletes,female_officials,male_officials,female_coaches,male_coaches,female_total_medals,male_total_medals,female_bronze_medals,female_gold_medals,female_silver_medals,male_bronze_medals,male_gold_medals,male_silver_medals,athletes
0,Albania,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,American Samoa,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,Andorra,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0


In [31]:
"""Ponemos en una lista las columnas que vamos a usar"""
columns = ['Cantidad total de atletas: athletes',
'Cantidad total de atletas hombres enviada: male_athletes',
'Cantidad total de atletas mujeres enviada: female_athletes',
'Cantidad total de coaches hombres enviada: male_coaches',
'Cantidad total de coaches mujeres enviada: female_coaches',
'Cantidad total de oficiales hombres enviada: male_officials',
'Cantidad total de oficiales mujeres enviada: female_officials',
'Cantidad total de medallas obtenidas por hombres: male_total_medals',
'Cantidad total de medallas obtenidas por mujeres: female_total_medals',
'Cantidad total de medallas de oro obtenidas por hombres: male_gold_medals',
'Cantidad total de medallas de oro obtenidas por mujeres: female_gold_medals',
'Cantidad total de medallas de plata obtenidas por hombres: male_silver_medals',
'Cantidad total de medallas de plata obtenidas por mujeres: female_silver_medals',
'Cantidad total de medallas de bronce obtenidas por hombres: male_bronze_medals',
'Cantidad total de medallas de bronce obtenidas por mujeres: female_bronze_medals']

"""En la misma lista vamos a tener solo los valores despues de dos dos
puntos"""
columns = ['country'] + [c.split(':')[1].strip() for c in columns]
columns[:3]

['country', 'athletes', 'male_athletes']

Observación split separá strings

In [32]:
"""Ordenamos según columns"""
df[columns].head(3)

Unnamed: 0,country,athletes,male_athletes,female_athletes,male_coaches,female_coaches,male_officials,female_officials,male_total_medals,female_total_medals,male_gold_medals,female_gold_medals,male_silver_medals,female_silver_medals,male_bronze_medals,female_bronze_medals
0,Albania,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,American Samoa,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Andorra,5.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
"""Para escribirlo en terminal"""
df[columns].to_clipboard()

In [34]:
df[columns]

Unnamed: 0,country,athletes,male_athletes,female_athletes,male_coaches,female_coaches,male_officials,female_officials,male_total_medals,female_total_medals,male_gold_medals,female_gold_medals,male_silver_medals,female_silver_medals,male_bronze_medals,female_bronze_medals
0,Albania,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,American Samoa,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Andorra,5.0,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Argentina,6.0,2.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Armenia,6.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
87,Turkey,7.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
88,Ukraine,46.0,24.0,22.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
89,United States of America,225.0,117.0,108.0,3.0,1.0,0.0,3.0,20.0,43.0,5.0,6.0,9.0,31.0,6.0,6.0
90,Uzbekistan,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [35]:
"""Para exportar"""
#df[columns].to_csv('Beijing_Oviedo_Quezada_Rolando.csv', index = False)

'Para exportar'