### Imports and archives

#### Anime.csv

* anime_id - myanimelist.net's unique id identifying an anime.
* name - full name of anime.
* genre - comma separated list of genres for this anime.
* type - movie, TV, OVA, etc.
* episodes - how many episodes in this show. (1 if movie).
* rating - average rating out of 10 for this anime.
* members - number of community members that are in this anime's "group".


In [5]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
import matplotlib.pyplot as plt
import pandas as pd

spark = SparkSession.builder.appName('AnimeAnalysis').getOrCreate()

spark.conf.set('spark.sql.repl.eagerEval.enabled', True)

In [6]:
anime = spark.read.csv('../data/raw/anime.csv', header=True, inferSchema=True)

### Data set Anime

In [7]:
# Mostrar la estructura del DataFrame
anime.printSchema()

# Mostrar los primeros 5 registros
anime.show(5)

# Contar el número de registros en el DataFrame
count = anime.count()
print('El número de registros en el DataFrame ANIME es:', count)

print((anime.count(), len(anime.columns)))



root
 |-- anime_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- type: string (nullable = true)
 |-- episodes: string (nullable = true)
 |-- rating: double (nullable = true)
 |-- members: integer (nullable = true)

+--------+--------------------+--------------------+-----+--------+------+-------+
|anime_id|                name|               genre| type|episodes|rating|members|
+--------+--------------------+--------------------+-----+--------+------+-------+
|   32281|      Kimi no Na wa.|Drama, Romance, S...|Movie|       1|  9.37| 200630|
|    5114|Fullmetal Alchemi...|Action, Adventure...|   TV|      64|  9.26| 793665|
|   28977|            Gintama°|Action, Comedy, H...|   TV|      51|  9.25| 114262|
|    9253|         Steins;Gate|    Sci-Fi, Thriller|   TV|      24|  9.17| 673572|
|    9969|       Gintama&#039;|Action, Comedy, H...|   TV|      51|  9.16| 151266|
+--------+--------------------+--------------------+-----+----

### Estadisticas

In [8]:
anime.select("episodes", "rating", "members").describe().show()

+-------+------------------+-----------------+-----------------+
|summary|          episodes|           rating|          members|
+-------+------------------+-----------------+-----------------+
|  count|             12294|            12064|            12294|
|   mean|12.382549774134182|6.473901690981445|18071.33886448674|
| stddev| 46.86535196440979|1.026746306898068|54820.67692490701|
|    min|                 1|             1.67|                5|
|    max|           Unknown|             10.0|          1013917|
+-------+------------------+-----------------+-----------------+



In [9]:
null_counts = anime.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in anime.columns])
null_counts.show()

+--------+----+-----+----+--------+------+-------+
|anime_id|name|genre|type|episodes|rating|members|
+--------+----+-----+----+--------+------+-------+
|       0|   0|   62|  25|       0|   230|      0|
+--------+----+-----+----+--------+------+-------+



In [10]:
anime.groupBy('type').count().orderBy('count', ascending=False).show()

+-------+-----+
|   type|count|
+-------+-----+
|     TV| 3787|
|    OVA| 3311|
|  Movie| 2348|
|Special| 1676|
|    ONA|  659|
|  Music|  488|
|   null|   25|
+-------+-----+



### Terminar Spark Session

In [None]:
# Cerrar SparkSession
spark.stop()