Zbior danych to oceny filmow. Dane pochodza z projektu https://movielens.org/
Wiecej informacji o danych mozna znalezc pod adresem: http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('DataFrame_add').master('local[*]').getOrCreate()

            

In [2]:
data_path = 'ml-latest-small/'

# Wczytanie DataFrame bezposrednio z JSON oraz CSV
movies = spark.read.option("header","true").csv(data_path+'movies.csv')
ratings = spark.read.option("header","true").csv(data_path+'ratings.csv')

1. Z jakich kolumn skladaja sie wczytane powyzej zbiory?
2. Informacje o ilu filmach znajduja sie w dostarczonym zbiorze?
3. Oceny ilu uzytkownikow znajduja sie w zbiorze?
4. Czy w zbiorze znajduja sie braki danych?
5. Ile filmow nie ma ocen? Ktore filmy nie maja ocen?
6. Ktory film ma najlepsza srednia ocen? Jesli jest takich wiele podaj ten z najwieksza liczba glosow.
7. Jaki procent filmow ma tylko maksymalne oceny?
8. Ktory film na najwyzsza minimalna ocene? Jesli jest takich wiele podaj ten z najwieksza liczba glosow.
9. Jaki jest rozklad ocen?
10. Ile jest filmow zaklasyfikowanych jako dokumentalne 'documentary'?
11. Ktory z filmow dokumentalnych z conajmniej 10 glosami ma najwysza srednia ocene?
12. Jakie sa roznice pomiedzy liczba filmow w zbiorze z roku na rok? Zaloz, ze timestamp reprezentuje liczbe sekund od roku 1960.

### 1. Z jakich kolumn skladaja sie wczytane powyzej zbiory?

In [3]:
movies.printSchema()

root
 |-- movieId: string (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)



In [4]:
ratings.printSchema()

root
 |-- userId: string (nullable = true)
 |-- movieId: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- timestamp: string (nullable = true)



### 2. Informacje o ilu filmach znajduja sie w dostarczonym zbiorze?

In [5]:
movies.describe().show()

+-------+------------------+--------------------+------------------+
|summary|           movieId|               title|            genres|
+-------+------------------+--------------------+------------------+
|  count|              9742|                9742|              9742|
|   mean|42200.353623485935|                null|              null|
| stddev| 52160.49485443825|                null|              null|
|    min|                 1|"11'09""01 - Sept...|(no genres listed)|
|    max|             99992|À nous la liberté...|           Western|
+-------+------------------+--------------------+------------------+



### 3. Oceny ilu uzytkownikow znajduja sie w zbiorze?

In [6]:
ratings3=ratings.groupBy("userId").count()

In [7]:
ratings3.count()

610

### 4. Czy w zbiorze znajduja sie braki danych?

In [8]:
nulls= ratings.filter(ratings.userId.isNull() | ratings.movieId.isNull() | ratings.rating.isNull() | ratings.timestamp.isNull())
nulls.count()

0

In [9]:
nulls= movies.filter(movies.movieId.isNull() | movies.title.isNull() | movies.genres.isNull())
nulls.count()

0

### 5. Ile filmow nie ma ocen? Ktore filmy nie maja ocen?

In [10]:
moviesWithRatings = movies.join(other=ratings, on='movieId', how='left')
NoRatings=moviesWithRatings.filter(moviesWithRatings.rating.isNull())
NoRatings.show()
NoRatings.count()

+-------+--------------------+--------------------+------+------+---------+
|movieId|               title|              genres|userId|rating|timestamp|
+-------+--------------------+--------------------+------+------+---------+
|   1076|Innocents, The (1...|Drama|Horror|Thri...|  null|  null|     null|
|   2939|      Niagara (1953)|      Drama|Thriller|  null|  null|     null|
|   3338|For All Mankind (...|         Documentary|  null|  null|     null|
|   3456|Color of Paradise...|               Drama|  null|  null|     null|
|   4194|I Know Where I'm ...|   Drama|Romance|War|  null|  null|     null|
|   5721|  Chosen, The (1981)|               Drama|  null|  null|     null|
|   6668|Road Home, The (W...|       Drama|Romance|  null|  null|     null|
|   6849|      Scrooge (1970)|Drama|Fantasy|Mus...|  null|  null|     null|
|   7020|        Proof (1991)|Comedy|Drama|Romance|  null|  null|     null|
|   7792|Parallax View, Th...|            Thriller|  null|  null|     null|
|   8765|Thi

18

### 6. Ktory film ma najlepsza srednia ocen? Jesli jest takich wiele podaj ten z najwieksza liczba glosow.

In [11]:
import pyspark.sql.functions as f

In [12]:
moviesWithRatings.groupBy('title').agg(f.avg('rating').alias('AvgRating'),
f.count('title').alias('NumberOfVotes')).orderBy(f.desc('AvgRating'),f.desc('NumberOfVotes')).show()

+--------------------+---------+-------------+
|               title|AvgRating|NumberOfVotes|
+--------------------+---------+-------------+
|Enter the Void (2...|      5.0|            2|
| Lesson Faust (1994)|      5.0|            2|
|Jonah Who Will Be...|      5.0|            2|
| Belle époque (1992)|      5.0|            2|
|Heidi Fleiss: Hol...|      5.0|            2|
|     Lamerica (1994)|      5.0|            2|
|Come and See (Idi...|      5.0|            2|
|George Carlin: Yo...|      5.0|            1|
|Vacations in Pros...|      5.0|            1|
|Tickling Giants (...|      5.0|            1|
|English Vinglish ...|      5.0|            1|
|      Villain (1971)|      5.0|            1|
|Winnie the Pooh a...|      5.0|            1|
|Paper Birds (Pája...|      5.0|            1|
|Awfully Big Adven...|      5.0|            1|
|         Rain (2001)|      5.0|            1|
|    Radio Day (2008)|      5.0|            1|
|National Lampoon'...|      5.0|            1|
|Martin Lawre

### 7. Jaki procent filmow ma tylko maksymalne oceny?

In [13]:
bestRatings=moviesWithRatings.groupBy('title').agg(f.min('rating').alias('Min'))
only5=bestRatings.where(bestRatings['Min']==5)
only5.count()/movies.count()*100

3.03839047423527

### 8. Ktory film na najwyzsza minimalna ocene? Jesli jest takich wiele podaj ten z najwieksza liczba glosow.

In [14]:
moviesWithRatings.groupBy('title').agg(f.min('rating').alias('Min'),
f.count('title').alias('NumberOfVotes')).orderBy(f.desc('Min'),f.desc('NumberOfVotes')).show()

+--------------------+---+-------------+
|               title|Min|NumberOfVotes|
+--------------------+---+-------------+
|     Lamerica (1994)|5.0|            2|
|Enter the Void (2...|5.0|            2|
| Lesson Faust (1994)|5.0|            2|
|Heidi Fleiss: Hol...|5.0|            2|
| Belle époque (1992)|5.0|            2|
|Come and See (Idi...|5.0|            2|
|Jonah Who Will Be...|5.0|            2|
|Adventures Of She...|5.0|            1|
|Alien Contaminati...|5.0|            1|
|7 Faces of Dr. La...|5.0|            1|
| 'Salem's Lot (2004)|5.0|            1|
|Battle Royale 2: ...|5.0|            1|
|All the Vermeers ...|5.0|            1|
|    12 Chairs (1976)|5.0|            1|
|Animals are Beaut...|5.0|            1|
|A Detective Story...|5.0|            1|
|A Perfect Day (2015)|5.0|            1|
|    All Yours (2016)|5.0|            1|
|American Friend, ...|5.0|            1|
|Assignment, The (...|5.0|            1|
+--------------------+---+-------------+
only showing top

### 9. Jaki jest rozklad ocen?

In [15]:
ratings.groupBy('rating').count().orderBy("rating").show()

+------+-----+
|rating|count|
+------+-----+
|   0.5| 1370|
|   1.0| 2811|
|   1.5| 1791|
|   2.0| 7551|
|   2.5| 5550|
|   3.0|20047|
|   3.5|13136|
|   4.0|26818|
|   4.5| 8551|
|   5.0|13211|
+------+-----+



### 10. Ile jest filmow zaklasyfikowanych jako dokumentalne 'documentary'?

In [16]:
documentary=movies.where(movies.genres.like('%Documentary%'))
documentary.show()

+-------+--------------------+-----------------+
|movieId|               title|           genres|
+-------+--------------------+-----------------+
|     77|    Nico Icon (1995)|      Documentary|
|     99|Heidi Fleiss: Hol...|      Documentary|
|    108|      Catwalk (1996)|      Documentary|
|    116|Anne Frank Rememb...|      Documentary|
|    128|Jupiter's Wife (1...|      Documentary|
|    137|Man of the Year (...|      Documentary|
|    162|        Crumb (1994)|      Documentary|
|    206|     Unzipped (1995)|      Documentary|
|    246|  Hoop Dreams (1994)|      Documentary|
|    363|Wonderful, Horrib...|      Documentary|
|    556|War Room, The (1993)|      Documentary|
|    581|Celluloid Closet,...|      Documentary|
|    602|Great Day in Harl...|      Documentary|
|    722|Haunted World of ...|      Documentary|
|    759|Maya Lin: A Stron...|      Documentary|
|    791|Last Klezmer: Leo...|      Documentary|
|   1050|Looking for Richa...|Documentary|Drama|
|   1111|Microcosmos

In [17]:
documentary.count()

440

### 11. Ktory z filmow dokumentalnych z conajmniej 10 glosami ma najwysza srednia ocene?

In [18]:
doc=moviesWithRatings.where(movies.genres.like('%Documentary%'))
doc2=doc.groupBy('title').agg(f.avg('rating').alias('AvgRating'),
f.count('title').alias('NumberOfVotes')).orderBy(f.desc('AvgRating'),f.desc('NumberOfVotes'))
doc2.where(doc2['NumberOfVotes']>9).show(1)

+--------------------+------------------+-------------+
|               title|         AvgRating|NumberOfVotes|
+--------------------+------------------+-------------+
|Fog of War: Eleve...|4.3076923076923075|           13|
+--------------------+------------------+-------------+
only showing top 1 row



### 12. Jakie sa roznice pomiedzy liczba ocen w zbiorze z roku na rok? Zaloz, ze timestamp reprezentuje liczbe sekund od roku 1960.

In [19]:
df12=ratings.withColumn(colName = 'year', col = 1960 + f.floor(ratings.timestamp/(60*60*24*365.2425)))
df12a=df12.groupBy('year').count().orderBy('year')
df12a.show()

+----+-----+
|year|count|
+----+-----+
|1986| 6031|
|1987| 1925|
|1988|  507|
|1989| 2439|
|1990|10061|
|1991| 3922|
|1992| 3478|
|1993| 4014|
|1994| 3274|
|1995| 5818|
|1996| 4059|
|1997| 7114|
|1998| 4345|
|1999| 4163|
|2000| 2302|
|2001| 1690|
|2002| 4656|
|2003| 1664|
|2004| 1439|
|2005| 6616|
+----+-----+
only showing top 20 rows

