## Análisis exploratorio

Cargamos el dataframe:

In [45]:
import pandas as pd

df = pd.read_csv(filepath_or_buffer='./books.csv')

# Columnas objetivo (David)
interesting_columns = [
    "books_count", "best_book_id", "reviews_count", "ratings_sum", "ratings_count",
    "text_reviews_count", "original_publication_date", "original_title", "media_type",
    "num_ratings_5", "num_ratings_4", "num_ratings_3", "num_ratings_2", "num_ratings_1"
]

df[interesting_columns].head()

Unnamed: 0,books_count,best_book_id,reviews_count,ratings_sum,ratings_count,text_reviews_count,original_publication_date,original_title,media_type,num_ratings_5,num_ratings_4,num_ratings_3,num_ratings_2,num_ratings_1
0,329,1,2484565,9397945,2061689,33051,2005/7/16,Harry Potter and the Half-Blood Prince,book,1370717,511319,148108,23215,8330
1,204,3,6587807,25971026,5817260,93396,1997/6/26,Harry Potter and the Philosopher's Stone,book,3732124,1357541,517618,117411,92566
2,148,5,2961357,10494960,2305410,45165,1999/7/8,Harry Potter and the Prisoner of Azkaban,book,1521804,573212,180231,22236,7927
3,6,10,37473,137459,29058,939,2005/1/1,"Harry Potter Collection (Harry Potter, #1-6)",book,23380,4206,1030,203,239
4,7,9,78,96,26,1,2005/4/1,"Unauthorized Harry Potter Book Seven News: ""Ha...",book,8,7,6,5,0


Veamos el tipo de los valores de cada columna:

In [46]:
df[interesting_columns].dtypes

books_count                   int64
best_book_id                  int64
reviews_count                 int64
ratings_sum                   int64
ratings_count                 int64
text_reviews_count            int64
original_publication_date    object
original_title               object
media_type                   object
num_ratings_5                 int64
num_ratings_4                 int64
num_ratings_3                 int64
num_ratings_2                 int64
num_ratings_1                 int64
dtype: object

Con los tipos y la muestra, podemos deducir la información de cada columna:

| Columna | Tipo | Observaciones |
| --- | --- | --- |
| books_count | int64 | ? |
| best_book_id | int64 | ? |
| reviews_count | int64 | Número de opiniones |
| ratings_sum | int64 | Sumatorio de las puntuaciones |
| ratings_count | int64 | Número de puntuaciones |
| text_reviews_count | int64 | Número de opiniones escritas |
| original_publication_date | object | Fecha de publicación original |
| original_title | object| Titulo original |
| media_type | object | Tipo de medio |
| num_ratings_5 | int64 | Número de veces que han puntuado con un 5 |
| num_ratings_4 | int64 | Número de veces que han puntuado con un 4 |
| num_ratings_3 | int64 | Número de veces que han puntuado con un 3 |
| num_ratings_2 | int64 | Número de veces que han puntuado con un 2 |
| num_ratings_1 | int64 | Número de veces que han puntuado con un 1 |



Vamos a eliminar las columnas que no aportan valor a nuestro estudio:

In [47]:
droppable_columns = ["books_count", "best_book_id", "original_title", "original_publication_date"]
df.drop(columns=droppable_columns, inplace=True)

df.head()

Unnamed: 0,id,isbn,author,title,isbn13,asin,kindle_asin,marketplace_id,country_code,publication_date,...,average_rating,num_pages,format,edition_information,ratings_count_global,text_reviews_count_global,authors,to_read,read,currently_reading
0,1,0439785960,,Harry Potter and the Half-Blood Prince (Harry ...,9780440000000.0,,,,ES,2006/9/16,...,4.56,652.0,Paperback,,1919694,25791,"['J.K. Rowling', 'Mary GrandPré']",229495.0,7492,22335.0
1,3,0439554934,,Harry Potter and the Sorcerer's Stone (Harry P...,9780440000000.0,,,,ES,1997/6/26,...,4.46,320.0,Hardcover,,5536122,69174,"['J.K. Rowling', 'Mary GrandPré']",3286.0,15019,87801.0
2,5,043965548X,,Harry Potter and the Prisoner of Azkaban (Harr...,9780440000000.0,,,,ES,2004/5/1,...,4.55,435.0,Mass Market Paperback,,2120869,33324,"['J.K. Rowling', 'Mary GrandPré']",193024.0,10469,31945.0
3,7,0439887453,,"The Harry Potter Collection (Harry Potter, #1-6)",9780440000000.0,,,,ES,2006/9/1,...,4.73,,Paperback,Box Set,1597,115,"['J.K. Rowling', 'Mary GrandPré']",4521.0,24,189.0
4,9,0976540606,,"Unauthorized Harry Potter Book Seven News: ""Ha...",9780977000000.0,,,,ES,2005/4/26,...,3.69,152.0,Paperback,,18,1,['W. Frederick Zimmerman'],27.0,2,


Veamos si existen nulos en las columnas seleccionadas:

In [48]:
print("reviews_count: {}".format(df["reviews_count"].isnull().sum()))
print("ratings_sum: {}".format(df["ratings_sum"].isnull().sum()))
print("ratings_count: {}".format(df["ratings_count"].isnull().sum()))
print("text_reviews_count: {}".format(df["text_reviews_count"].isnull().sum()))
print("media_type: {}".format(df["media_type"].isnull().sum()))
print("num_ratings_5: {}".format(df["num_ratings_5"].isnull().sum()))
print("num_ratings_4: {}".format(df["num_ratings_4"].isnull().sum()))
print("num_ratings_3: {}".format(df["num_ratings_3"].isnull().sum()))
print("num_ratings_2: {}".format(df["num_ratings_2"].isnull().sum()))
print("num_ratings_1: {}".format(df["num_ratings_1"].isnull().sum()))

reviews_count: 0
ratings_sum: 0
ratings_count: 0
text_reviews_count: 0
media_type: 936
num_ratings_5: 0
num_ratings_4: 0
num_ratings_3: 0
num_ratings_2: 0
num_ratings_1: 0


Parece que la columna *media_type* tiene nulos, veamos los valores únicos:

In [56]:
df["media_type"].unique()

Vamos a comprobar qué tipo de elementos son considerados *not a book*:

In [50]:
df[["title", "media_type"]][df.media_type == "not a book"]

Unnamed: 0,title,media_type
1320,Mini-Manual of the Urban Guerrilla,not a book
1806,The Second World War: A Complete History,not a book


Veamos ahora los que se consideran *periodical*:

In [51]:
df[["title", "media_type"]][df.media_type == "periodical"]

Unnamed: 0,title,media_type
2952,McSweeney's #24,periodical


Por último, veamos los primeros que son considerados nulos:

In [52]:
df[["title", "media_type"]][df.media_type.isna()].head()

Unnamed: 0,title,media_type
54,The Pobble Who Has No Toes (Edward Lear's Litt...,
56,Public Places-Urban Spaces: The Dimensions of ...,
57,Haunted Places in America: A Guide to Spooked ...,
58,Together Alone: Personal Relationships in Publ...,
59,Art for Public Places: Critical Essays,


Sustituimos los items *NaN* por *Unknown*:

In [53]:
df.media_type = df.media_type.fillna(value='unknown')
df.media_type.isna().sum()
df.media_type.unique()

array(['book', 'unknown', 'not a book', 'periodical'], dtype=object)