## Estructuras de Pandas

Pandas presenta 2 principales estructuras:
 - DataFrame
 - Series
 
#### DataFrame
Un DataFrame es una estructura de datos similar a una tabla, de 2 dimensiones, con ejes etiquetados (index para las filas, nombre de columnas para las columnas).

![Pandas DataFrame](https://media.geeksforgeeks.org/wp-content/uploads/finallpandas.png "Pandas DataFrame")


#### Series
Por otro lado, las Series corresponden a una lista de valores (1-dimensional). En esencia, son una unica columna de un DataFrame. En la imagen, podemos apreciar 3 Series distintas: ```'Name'```, ```'Team'``` y ```'Number'```

![Pandas Series](https://media.geeksforgeeks.org/wp-content/uploads/dataSER-1.png "Pandas Series")

In [1]:
import pandas as pd

In [29]:
df = pd.read_csv('anime.csv')

#### DataFrame

In [30]:
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
11067,34102,Sakurada Reset,"Mystery, School, Super Power, Supernatural",TV,Unknown,,2076
11068,33889,Saredo Tsumibito wa Ryuu to Odoru,"Action, Drama, Fantasy, Sci-Fi",TV,Unknown,,416
11069,34289,Schoolgirl Strikers: Animation Channel,"Action, School",TV,Unknown,,1465
11070,32032,Seikaisuru Kado,,,Unknown,,1797


#### Series

In [31]:
df['name']

0                                Kimi no Na wa.
1              Fullmetal Alchemist: Brotherhood
2                                      Gintama°
3                                   Steins;Gate
4                                 Gintama&#039;
                          ...                  
11067                            Sakurada Reset
11068         Saredo Tsumibito wa Ryuu to Odoru
11069    Schoolgirl Strikers: Animation Channel
11070                           Seikaisuru Kado
11071                                    Seiren
Name: name, Length: 11072, dtype: object

## Informacion General del DataFrame

```df.head()``` Nos muestra los primero elementos del DataFrame, utilizado para tener un vistazo rapido a nuestros datos. Por defecto, muestra las primeros 5 filas, pero si le pasamos un numero a la funcion nos mostrara las primeras n filas

In [3]:
df.head()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266


```df.shape``` Nos mostrara la forma que tiene nuestro DataFrame, mediante la tupla (filas, columnas)

In [4]:
df.shape

(11072, 7)

```df.dtypes``` Nos muestra el tipo de datos de cada columna del DataFrame. Los Strings se consideran como Object para Pandas

In [5]:
df.dtypes

anime_id      int64
name         object
genre        object
type         object
episodes     object
rating      float64
members       int64
dtype: object

## Acceso a los datos

```df.nombreColumna``` Retorna la Serie (columna) de interes. Este metodo tiene problemas con aquellas columnas que tengan un espacio en el nombre, por ej: ```df.nombre anime``` va a fallar.

In [6]:
df.rating

0        9.37
1        9.26
2        9.25
3        9.17
4        9.16
         ... 
11067     NaN
11068     NaN
11069     NaN
11070     NaN
11071     NaN
Name: rating, Length: 11072, dtype: float64

```df['nombreColumna']``` Retorna la Serie (columna) de interes. Este metodo no tiene problemas con aquellas columnas con espacios, ademas de poder retornar mas de 1 columna si es que le pasamos una lista. ej: ```df[['nombreColumna1', 'nombreColumna2']]```

In [7]:
df['rating']

0        9.37
1        9.26
2        9.25
3        9.17
4        9.16
         ... 
11067     NaN
11068     NaN
11069     NaN
11070     NaN
11071     NaN
Name: rating, Length: 11072, dtype: float64

In [8]:
df[['name', 'rating']]

Unnamed: 0,name,rating
0,Kimi no Na wa.,9.37
1,Fullmetal Alchemist: Brotherhood,9.26
2,Gintama°,9.25
3,Steins;Gate,9.17
4,Gintama&#039;,9.16
...,...,...
11067,Sakurada Reset,
11068,Saredo Tsumibito wa Ryuu to Odoru,
11069,Schoolgirl Strikers: Animation Channel,
11070,Seikaisuru Kado,


## Operaciones Basicas sobre Datos

Sobre las columnas numericas del DataFrame, se pueden aplicar las operaciones basicas (+, -, *, /), ya sea para sumar alguna constante o sumar columnas entre si

In [9]:
df['rating'] + 10

0        19.37
1        19.26
2        19.25
3        19.17
4        19.16
         ...  
11067      NaN
11068      NaN
11069      NaN
11070      NaN
11071      NaN
Name: rating, Length: 11072, dtype: float64

In [10]:
df['rating'] + df['members']

0        200639.37
1        793674.26
2        114271.25
3        673581.17
4        151275.16
           ...    
11067          NaN
11068          NaN
11069          NaN
11070          NaN
11071          NaN
Length: 11072, dtype: float64

Ademas, Pandas presenta diversas funciones para obtener mas informacion acerca de nuestros datos.

```sum()```: Suma los valores de la columna\
```mean()```: Calcula el promedio de la columna\
```describe()```: Retorna informacion estadistica sobre la columna (promedio, desviacion estandar, etc..)\
```value_counts()```: Cuenta la cantidad de veces que se repite cierto valor en la columna

**Nota:** Algunas de estas operaciones pueden ejecutarse tanto sobre una columna especifica, como sobre el DataFrame entero, en cuyo caso Pandas las ejecutara unicamente sobre las columnas numericas

```sum()```

In [11]:
df.sum()

anime_id                                            159716780
name        Kimi no Na wa.Fullmetal Alchemist: Brotherhood...
episodes    1645124511014811011324120125251221012575426121...
rating                                                70859.1
members                                             218573467
dtype: object

In [12]:
df['rating'].sum()

70859.13

```mean()```

In [13]:
df.mean()

anime_id    14425.287211
rating          6.504418
members     19741.100704
dtype: float64

In [14]:
df['rating'].mean()

6.50441802827244

```describe()```

In [15]:
df.describe()

Unnamed: 0,anime_id,rating,members
count,11072.0,10894.0,11072.0
mean,14425.287211,6.504418,19741.1
std,11516.934152,1.050491,57506.97
min,1.0,1.67,5.0
25%,3504.5,5.9,188.0
50%,10747.0,6.62,1429.0
75%,25394.0,7.24,11720.25
max,34527.0,10.0,1013917.0


In [16]:
df['rating'].describe()

count    10894.000000
mean         6.504418
std          1.050491
min          1.670000
25%          5.900000
50%          6.620000
75%          7.240000
max         10.000000
Name: rating, dtype: float64

```value_counts()```

In [17]:
df['type'].value_counts()

TV         3764
Movie      2331
OVA        2164
Special    1649
ONA         654
Music       488
Name: type, dtype: int64

## Valores Nulos

En un principio, podemos ver la cantidad de valores nulos en nuestro DataFrame utilizando ```isnull()```. Esta nos retornara ```True``` o ```False``` dependiendo de si el valor es efectivamente nulo (```NaN```) o no 

In [18]:
df.isnull()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...
11067,False,False,False,False,False,True,False
11068,False,False,False,False,False,True,False
11069,False,False,False,False,False,True,False
11070,False,False,True,True,False,True,False


Esta representacion no es muy util, por lo que podemos utilizar la ya vista ```sum()``` para sumar la cantidad de valores nulos

In [19]:
df.isnull().sum()

anime_id      0
name          0
genre        58
type         22
episodes      0
rating      178
members       0
dtype: int64

Si quisieramos obtener el total en todo el DataFrame, podemos volver a sumar, de la forma:

In [20]:
df.isnull().sum().sum()

258

Para lidiar con los valores nulos, existen 2 enfoques:
 - Reemplazar
 - Eliminar

#### Reemplazar
La forma mas simple de reemplazar consiste en utilizar la funcion de Pandas ```fillna()```, donde cualquier valor NaN se reemplazara por lo que queramos

In [21]:
df.fillna(0)

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
11067,34102,Sakurada Reset,"Mystery, School, Super Power, Supernatural",TV,Unknown,0.00,2076
11068,33889,Saredo Tsumibito wa Ryuu to Odoru,"Action, Drama, Fantasy, Sci-Fi",TV,Unknown,0.00,416
11069,34289,Schoolgirl Strikers: Animation Channel,"Action, School",TV,Unknown,0.00,1465
11070,32032,Seikaisuru Kado,0,0,Unknown,0.00,1797


#### Eliminar
Por otro lado, existe la opcion de eliminar aquellas filas que presenten valores nulos mediante ```dropna()```

In [22]:
df.dropna()

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
10891,11095,Zouressha ga Yatte Kita,Adventure,Movie,1,6.06,78
10892,7808,Zukkoke Knight: Don De La Mancha,"Adventure, Comedy, Historical, Romance",TV,23,6.47,172
10893,28543,Zukkoke Sannin-gumi no Hi Asobi Boushi Daisakusen,"Drama, Kids",OVA,1,5.83,50
10894,18967,Zukkoke Sannin-gumi: Zukkoke Jikuu Bouken,"Comedy, Historical, Sci-Fi",OVA,1,6.13,76


Esto consiste unicamente en una **vista** del DataFrame, el cual, por ahora, sigue teniendo valores ```NaN```.

Si quisieramos guardar este resultado, podemos asignarlo nuevamente, de la forma ```df = df.dropna()```.
En este ejemplo lo asignaremos a otra variable distinta

In [23]:
df_limpio = df.dropna()
df_limpio

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
10891,11095,Zouressha ga Yatte Kita,Adventure,Movie,1,6.06,78
10892,7808,Zukkoke Knight: Don De La Mancha,"Adventure, Comedy, Historical, Romance",TV,23,6.47,172
10893,28543,Zukkoke Sannin-gumi no Hi Asobi Boushi Daisakusen,"Drama, Kids",OVA,1,5.83,50
10894,18967,Zukkoke Sannin-gumi: Zukkoke Jikuu Bouken,"Comedy, Historical, Sci-Fi",OVA,1,6.13,76


In [24]:
df

Unnamed: 0,anime_id,name,genre,type,episodes,rating,members
0,32281,Kimi no Na wa.,"Drama, Romance, School, Supernatural",Movie,1,9.37,200630
1,5114,Fullmetal Alchemist: Brotherhood,"Action, Adventure, Drama, Fantasy, Magic, Mili...",TV,64,9.26,793665
2,28977,Gintama°,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.25,114262
3,9253,Steins;Gate,"Sci-Fi, Thriller",TV,24,9.17,673572
4,9969,Gintama&#039;,"Action, Comedy, Historical, Parody, Samurai, S...",TV,51,9.16,151266
...,...,...,...,...,...,...,...
11067,34102,Sakurada Reset,"Mystery, School, Super Power, Supernatural",TV,Unknown,,2076
11068,33889,Saredo Tsumibito wa Ryuu to Odoru,"Action, Drama, Fantasy, Sci-Fi",TV,Unknown,,416
11069,34289,Schoolgirl Strikers: Animation Channel,"Action, School",TV,Unknown,,1465
11070,32032,Seikaisuru Kado,,,Unknown,,1797


Donde podemos ver las diferencias entre ambos DataFrames

## Agrupacion

Finalmente, para lograr la agrupacion de datos, podemos utilizar ```groupby()```

In [25]:
# Agrupamos los animes segun su tipo (pelicula, serie TV, etc..)
grupos = df.groupby('type')

A modo de visualizacion, para entender lo que esta realizando groupby, podemos mostrar los grupos creados

In [26]:
grupos = df.groupby('type')
for name, group in grupos:
    print(name)
    print(group)
    print()

Movie
       anime_id                                               name  \
0         32281                                     Kimi no Na wa.   
8         15335  Gintama Movie: Kanketsu-hen - Yorozuya yo Eien...   
11        28851                                     Koe no Katachi   
15          199                      Sen to Chihiro no Kamikakushi   
18        12355                       Ookami Kodomo no Ame to Yuki   
...         ...                                                ...   
11035     32900  Mahouka Koukou no Rettousei Movie: Hoshi wo Yo...   
11040     20715                                               Mint   
11054     34161                                     Overlord Movie   
11057     30695                                           Pop in Q   
11059     33391                  Red Ash The Animation: Magicicada   

                                                   genre   type episodes  \
0                   Drama, Romance, School, Supernatural  Movie        1   
8

Sin embargo, el uso de ```groupby``` viene al obtener datos sobre las agrupaciones generadas

In [27]:
# Mostrar el promedio de rating segun el grupo
grupos['rating'].mean()

type
Movie      6.320852
Music      5.588996
ONA        5.643436
OVA        6.470460
Special    6.527570
TV         6.902299
Name: rating, dtype: float64

Finalmente, para aplicar multiples operaciones sobre el DataFrame, podemos utilizar ```.agg```

In [28]:
grupos['rating'].agg([min, max, 'mean'])

Unnamed: 0_level_0,min,max,mean
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Movie,2.49,10.0,6.320852
Music,3.28,8.38,5.588996
ONA,2.58,8.26,5.643436
OVA,2.0,9.25,6.47046
Special,1.67,8.66,6.52757
TV,2.67,9.6,6.902299
