# Desafío - Introducción a Big Data

## Ejercicio 1: Ingesta de datos semiestructurados

In [1]:
import json
import requests

In [2]:
response = requests.request('GET', 'https://www.balldontlie.io/api/v1/games?per_page=100')
data = json.loads(response.text)

### Metadatos

In [3]:
data['meta']

{'total_pages': 488,
 'current_page': 1,
 'next_page': 2,
 'per_page': 100,
 'total_count': 48755}

### ¿Es semi estructuada?

Un análisis visual me dice que sí es semiestructurada. A continuación lo verificaré:

In [4]:
data['data']

[{'id': 47179,
  'date': '2019-01-30T00:00:00.000Z',
  'home_team': {'id': 2,
   'abbreviation': 'BOS',
   'city': 'Boston',
   'conference': 'East',
   'division': 'Atlantic',
   'full_name': 'Boston Celtics',
   'name': 'Celtics'},
  'home_team_score': 126,
  'period': 4,
  'postseason': False,
  'season': 2018,
  'status': 'Final',
  'time': ' ',
  'visitor_team': {'id': 4,
   'abbreviation': 'CHA',
   'city': 'Charlotte',
   'conference': 'East',
   'division': 'Southeast',
   'full_name': 'Charlotte Hornets',
   'name': 'Hornets'},
  'visitor_team_score': 94},
 {'id': 48751,
  'date': '2019-02-09T00:00:00.000Z',
  'home_team': {'id': 2,
   'abbreviation': 'BOS',
   'city': 'Boston',
   'conference': 'East',
   'division': 'Atlantic',
   'full_name': 'Boston Celtics',
   'name': 'Celtics'},
  'home_team_score': 112,
  'period': 4,
  'postseason': False,
  'season': 2018,
  'status': 'Final',
  'time': '     ',
  'visitor_team': {'id': 13,
   'abbreviation': 'LAC',
   'city': 'LA',


Como podemos ver, todos los registros tienen la siguiente información:

In [5]:
print(', '.join(data['data'][0].keys()))

id, date, home_team, home_team_score, period, postseason, season, status, time, visitor_team, visitor_team_score


Verificamos:

In [6]:
keys = data['data'][0].keys()

def has_all_keys(dic, keys):
    for key in dic.keys():
        if not key in keys:
            return False
    return True

result = list(filter(lambda item: not has_all_keys(item, keys), data['data']))

if len(result) == 0:
    print("Todos los registros cumplen con la estructura")
else:
    print(f"Existen {len(result)} registros que no cumplen con la estructura")

Todos los registros cumplen con la estructura


### ¿Cuáles son las llaves de cada registro?

Las llaves de cada registro están bajo el atributo `id`. Cada dato tiene su `id`, y los objetos anidados (`home_team` y `visitor_team`) también tienen el campo `id` como identificador.

## Ejercicio 2: Organización de los datos

In [7]:
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

In [8]:
df = json_normalize(data['data'])
df.head()

Unnamed: 0,date,home_team.abbreviation,home_team.city,home_team.conference,home_team.division,home_team.full_name,home_team.id,home_team.name,home_team_score,id,...,status,time,visitor_team.abbreviation,visitor_team.city,visitor_team.conference,visitor_team.division,visitor_team.full_name,visitor_team.id,visitor_team.name,visitor_team_score
0,2019-01-30T00:00:00.000Z,BOS,Boston,East,Atlantic,Boston Celtics,2,Celtics,126,47179,...,Final,,CHA,Charlotte,East,Southeast,Charlotte Hornets,4,Hornets,94
1,2019-02-09T00:00:00.000Z,BOS,Boston,East,Atlantic,Boston Celtics,2,Celtics,112,48751,...,Final,,LAC,LA,West,Pacific,LA Clippers,13,Clippers,123
2,2019-02-08T00:00:00.000Z,PHI,Philadelphia,East,Atlantic,Philadelphia 76ers,23,76ers,117,48739,...,Final,,DEN,Denver,West,Northwest,Denver Nuggets,8,Nuggets,110
3,2019-02-08T00:00:00.000Z,WAS,Washington,East,Southeast,Washington Wizards,30,Wizards,119,48740,...,Final,,CLE,Cleveland,East,Central,Cleveland Cavaliers,6,Cavaliers,106
4,2019-02-08T00:00:00.000Z,SAC,Sacramento,West,Pacific,Sacramento Kings,26,Kings,102,48746,...,Final,,MIA,Miami,East,Southeast,Miami Heat,16,Heat,96


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
date                         100 non-null object
home_team.abbreviation       100 non-null object
home_team.city               100 non-null object
home_team.conference         100 non-null object
home_team.division           100 non-null object
home_team.full_name          100 non-null object
home_team.id                 100 non-null int64
home_team.name               100 non-null object
home_team_score              100 non-null int64
id                           100 non-null int64
period                       100 non-null int64
postseason                   100 non-null bool
season                       100 non-null int64
status                       100 non-null object
time                         100 non-null object
visitor_team.abbreviation    100 non-null object
visitor_team.city            100 non-null object
visitor_team.conference      100 non-null object
visitor_team.division

## Ejercicio 3: El efecto de jugar de local

Genere una columna en el pd.DataFrame que identifique si el equipo que jugó de local ganó(1) o no (0)

In [10]:
df['home_team_won'] = np.where(df['home_team_score'] > df['visitor_team_score'], 1, 0)

Repita el procedimiento para identificar si el equipo que jugó de visitante ganó (1) o no (0)

In [11]:
df['visitor_team_won'] = np.where(df['home_team_score'] < df['visitor_team_score'], 1, 0)

Reporte cuáles son los primeros y últimos 5 equipos en cuanto a desempeño por jugar local.

In [12]:
cols = ['home_team.full_name', 'home_team_won']
home_report = df[cols].groupby('home_team.full_name').sum().sort_values(by='home_team_won', ascending=False)

**Los primeros 5** en cuanto a desempeño local:

In [13]:
home_report.head(5)

Unnamed: 0_level_0,home_team_won
home_team.full_name,Unnamed: 1_level_1
Washington Wizards,4
Detroit Pistons,4
Indiana Pacers,4
Orlando Magic,4
Philadelphia 76ers,4


**Los últimos 5** en cuanto a desempeño local:

In [14]:
home_report.tail(5)

Unnamed: 0_level_0,home_team_won
home_team.full_name,Unnamed: 1_level_1
Miami Heat,1
Oklahoma City Thunder,0
Phoenix Suns,0
New York Knicks,0
Charlotte Hornets,0


Reporte cuáles son los primeros y últimos 5 equipos en cuanto a desempeño por jugar de visita.

In [15]:
cols = ['visitor_team.full_name', 'visitor_team_won']
visitor_report = df[cols].groupby('visitor_team.full_name').sum().sort_values(by='visitor_team_won', ascending=False)

**Los primeros 5** en cuanto a desempeño por jugar de visita:

In [16]:
visitor_report.head(5)

Unnamed: 0_level_0,visitor_team_won
visitor_team.full_name,Unnamed: 1_level_1
Orlando Magic,4
LA Clippers,3
Miami Heat,2
Indiana Pacers,2
Portland Trail Blazers,2


**Los últimos 5** en cuanto a desempeño por jugar de visita:

In [17]:
visitor_report.tail(5)

Unnamed: 0_level_0,visitor_team_won
visitor_team.full_name,Unnamed: 1_level_1
New York Knicks,0
Minnesota Timberwolves,0
Dallas Mavericks,0
Memphis Grizzlies,0
Atlanta Hawks,0


## Ejercicio 4: Obteniendo el porcentaje de ganar local y de visita

Genere un nuevo objeto que guarde el porcentaje de juegos ganados como local por equipo.

In [18]:
def victories_percent(group):
    total_victories = group['home_team_won'].sum()
    matches = group.shape[0]
    return 100 * total_victories / matches

winners_result = { full_name: victories_percent(group) for full_name, group in df.groupby('home_team.full_name') }

Repita lo mismo para los juegos donde el equipo fue visitante.

In [19]:
def lost_percent(group):
    total_lost = group['visitor_team_won'].sum()
    matches = group.shape[0]
    return 100 * total_lost / matches

losers_result = { full_name: lost_percent(group) for full_name, group in df.groupby('visitor_team.full_name')}

¿Qué equipos tienen iguales chances de ganar como visitante o local?

In [20]:
percent_data = {
    'team': list(winners_result.keys()),
    'victories_percent': list(winners_result.values()),
    'lost_percent': [losers_result[key] for key in winners_result.keys()],
}

In [21]:
percent_df = pd.DataFrame(percent_data)

Los equipos que tienen la misma chance de ganar como visitante o local son:

In [22]:
percent_df[percent_df['victories_percent'] == percent_df['lost_percent']]

Unnamed: 0,team,victories_percent,lost_percent
12,Los Angeles Lakers,50.0,50.0
17,New Orleans Pelicans,33.333333,33.333333
18,New York Knicks,0.0,0.0
22,Phoenix Suns,0.0,0.0
