# Exercise 1: The Movie Database analysis
The file `data/tmdb_movies.json` contains a lot of useful information about a ton of movies extracted from The Movie Database (https://www.themoviedb.org/). The JSON file composed by a list of movies with the following information:

```
    {
        budget: Cost (in $),
        genres: A list of movie genres,
        homepage: Movie homepage URL,
        id: Movie ID in this database,
        keywords: Keywords associated with the movie,
        original_language: Language code representing the language in which the movie was created,
        original_title: Original movie title,
        overview: Brief description about the movie,
        popularity: Popularity score,
        production_companies: List of companies that produced the movie,
        production_countries: List of countries in which the movie was produced,
        release_date: Movie release date,
        revenue: Revenue (in $),
        runtime: Moviel length (in minutes),
        spoken_languages: Languages spoken in the movie,
        status: Release status,
        tagline: Tagline,
        title: Title,
        vote_average: Average voting score,
        vote_count: Number of votes
    }
```

Answer the following questions
1. Import the JSON file and count the number of movies
2. Add a new key to the dict called `year`, and store the release year
3. Write a function that prints the following information of each movie `{year}: {original_title}`
```python
    def print_movies(movies: list)
```
3. Write a function that given a year returns those movies released after that year (included). How many movies have been released after 2015?
```python
    def movies_by_year(movies: list, year: int) -> list
```

5. Write a function that given a list of keywords returns all related movies with those keywords
```python
    def movies_by_keywords(movies: list, keywords: set) -> list
```

4. How many movies did `Twentieth Century Fox Film Corporation` participate as producer
6. Get the average rating of all `Fantasy` movies released after 2015 (included)
1. What is the all times most expensive `space_travel` movie?
1. What is the country in which more `romantic` movies have been produced?

In [3]:
import json
with open("data/tmdb_movies.json") as file:
    movies = json.load(file)


In [4]:
#1
print(f"This file has {len(movies)} movies")

This file has 4800 movies


In [5]:
#2
for movie in movies:
    year = movie["release_date"][:4]
    movie["year"] = year


In [6]:
#3
def print_movies(movies):
    for movie in movies:
        print(f"Year: {movie['year']} | Movie: {movie['original_title']}")
        
print_movies(movies)

Year: 2009 | Movie: Avatar
Year: 2007 | Movie: Pirates of the Caribbean: At World's End
Year: 2015 | Movie: Spectre
Year: 2012 | Movie: The Dark Knight Rises
Year: 2012 | Movie: John Carter
Year: 2007 | Movie: Spider-Man 3
Year: 2010 | Movie: Tangled
Year: 2015 | Movie: Avengers: Age of Ultron
Year: 2009 | Movie: Harry Potter and the Half-Blood Prince
Year: 2016 | Movie: Batman v Superman: Dawn of Justice
Year: 2006 | Movie: Superman Returns
Year: 2008 | Movie: Quantum of Solace
Year: 2006 | Movie: Pirates of the Caribbean: Dead Man's Chest
Year: 2013 | Movie: The Lone Ranger
Year: 2013 | Movie: Man of Steel
Year: 2008 | Movie: The Chronicles of Narnia: Prince Caspian
Year: 2012 | Movie: The Avengers
Year: 2011 | Movie: Pirates of the Caribbean: On Stranger Tides
Year: 2012 | Movie: Men in Black 3
Year: 2014 | Movie: The Hobbit: The Battle of the Five Armies
Year: 2012 | Movie: The Amazing Spider-Man
Year: 2010 | Movie: Robin Hood
Year: 2013 | Movie: The Hobbit: The Desolation of Smaug

In [7]:
#4
def movies_by_year(movies, year):
    L = []
    for movie in movies:
        if int(movie["year"]) >= year:
            L.append(movie)
    return L

print_movies(movies_by_year(movies, 2015))

Year: 2015 | Movie: Spectre
Year: 2015 | Movie: Avengers: Age of Ultron
Year: 2016 | Movie: Batman v Superman: Dawn of Justice
Year: 2016 | Movie: Captain America: Civil War
Year: 2015 | Movie: Jurassic World
Year: 2015 | Movie: Furious 7
Year: 2015 | Movie: The Good Dinosaur
Year: 2016 | Movie: Star Trek Beyond
Year: 2015 | Movie: Jupiter Ascending
Year: 2016 | Movie: The Legend of Tarzan
Year: 2016 | Movie: X-Men: Apocalypse
Year: 2016 | Movie: Suicide Squad
Year: 2015 | Movie: Inside Out
Year: 2016 | Movie: The Jungle Book
Year: 2015 | Movie: The Lovers
Year: 2015 | Movie: Tomorrowland
Year: 2016 | Movie: Independence Day: Resurgence
Year: 2016 | Movie: シン・ゴジラ
Year: 2015 | Movie: The Hunger Games: Mockingjay - Part 2
Year: 2016 | Movie: Alice Through the Looking Glass
Year: 2016 | Movie: Warcraft
Year: 2015 | Movie: Terminator Genisys
Year: 2015 | Movie: Mad Max: Fury Road
Year: 2015 | Movie: Mission: Impossible - Rogue Nation
Year: 2015 | Movie: Pan
Year: 2016 | Movie: Ghostbusters

In [8]:
#5
def movies_by_keywords(movies, keywords):
    L = []
    keywords = set(keywords)
    for movie in movies:
        kws = [m["name"] for m in movie["keywords"]]
        kws = set(kws)
        
        if keywords.issubset(kws):
            L.append(movie)
        
    return L

print_movies(movies_by_keywords(movies, ["space war"]))

Year: 2009 | Movie: Avatar
Year: 1984 | Movie: The Ice Pirates


In [11]:
#6
def movies_by_production_company(movies, company):
    L = []
    for movie in movies:
        production_companies = set([m["name"] for m in movie['production_companies']])
        
        if company in production_companies:
            L.append(movie)
    
    return L

print_movies(movies_by_production_company(movies, "Twentieth Century Fox Film Corporation"))

In [33]:
#7
def movies_by_genre(movies, genre):
    L = []
    for movie in movies:
        movie_genre = set([m["name"] for m in movie["genres"]])
        
        if genre in movie_genre:
            L.append(movie)
        
    return L


print_movies(movies_by_year(movies_by_genre(movies, "Fantasy"), 2015))
my_movies = movies_by_year(movies_by_genre(movies, "Fantasy"), 2015)

ratings = [m["vote_average"] for m in my_movies]
avg_rating = sum(ratings)/len(ratings)
print(f"The average rating for Fatasy movies released after 2015 is {round(avg_rating, 2)}")


Year: 2016 | Movie: Batman v Superman: Dawn of Justice
Year: 2015 | Movie: Jupiter Ascending
Year: 2016 | Movie: Suicide Squad
Year: 2016 | Movie: The Jungle Book
Year: 2016 | Movie: Alice Through the Looking Glass
Year: 2016 | Movie: Warcraft
Year: 2015 | Movie: Pan
Year: 2016 | Movie: Ghostbusters
Year: 2016 | Movie: Gods of Egypt
Year: 2016 | Movie: The BFG
Year: 2015 | Movie: Home
Year: 2016 | Movie: Teenage Mutant Ninja Turtles: Out of the Shadows
Year: 2015 | Movie: Cinderella
Year: 2015 | Movie: The Last Witch Hunter
Year: 2015 | Movie: The Little Prince
Year: 2016 | Movie: 西游记之孙悟空三打白骨精
Year: 2015 | Movie: The Age of Adaline
Year: 2015 | Movie: Савва. Сердце воина
Year: 2016 | Movie: Sausage Party
Year: 2015 | Movie: Krampus
Year: 2016 | Movie: Pete's Dragon
Year: 2016 | Movie: Yoga Hosers
Year: 2015 | Movie: Do You Believe?


In [51]:
#8
st_movies = movies_by_keywords(movies, ["space travel"])

most_expensive_st_movie = sorted(st_movies, key=lambda x: x["budget"], reverse = True)[0]

print(most_expensive_st_movie["original_title"])
movies[0]

John Carter


{'budget': 237000000,
 'genres': [{'id': 28, 'name': 'Action'},
  {'id': 12, 'name': 'Adventure'},
  {'id': 14, 'name': 'Fantasy'},
  {'id': 878, 'name': 'Science Fiction'}],
 'homepage': 'http://www.avatarmovie.com/',
 'id': 19995,
 'keywords': [{'id': 1463, 'name': 'culture clash'},
  {'id': 2964, 'name': 'future'},
  {'id': 3386, 'name': 'space war'},
  {'id': 3388, 'name': 'space colony'},
  {'id': 3679, 'name': 'society'},
  {'id': 3801, 'name': 'space travel'},
  {'id': 9685, 'name': 'futuristic'},
  {'id': 9840, 'name': 'romance'},
  {'id': 9882, 'name': 'space'},
  {'id': 9951, 'name': 'alien'},
  {'id': 10148, 'name': 'tribe'},
  {'id': 10158, 'name': 'alien planet'},
  {'id': 10987, 'name': 'cgi'},
  {'id': 11399, 'name': 'marine'},
  {'id': 13065, 'name': 'soldier'},
  {'id': 14643, 'name': 'battle'},
  {'id': 14720, 'name': 'love affair'},
  {'id': 165431, 'name': 'anti war'},
  {'id': 193554, 'name': 'power relations'},
  {'id': 206690, 'name': 'mind and soul'},
  {'id': 2

In [62]:
#9 
romantic_movies = movies_by_keywords(movies, ["romance"])

def movies_by_country(romantic_movies):
    L = []
    for movie in romantic_movies:
        movie_country = [m["name"] for m in movie["production_countries"]]
        L.extend(movie_country)
        
    return L

country_list = movies_by_country(romantic_movies)

unique_countries = set(country_list)

for country in unique_countries:
    count = country_list.count(country)
    print(f"{country}: {count}")


Italy: 2
South Africa: 1
United Kingdom: 4
Czech Republic: 1
United States of America: 18
France: 3
Spain: 1
Germany: 2


# Exercise 2

We will work with Twitter data again

 1. Using the `os` module, create a folder called `TWEETS` in the root directory
 2. For each tweet in `twitter_data.json` create a subfolder inside `TWEETS` with the tweet creation date as the folder name
 3. Save all tweets (in json format) in its corresponding folder according to the creation date. Use the tweet ID as the file name.

_NOTE: You cannot use the file system explorer to create folders, you have to use the `os` module instead_

In [65]:
import os
#1
os.mkdir("TWEETS")

FileExistsError: [Errno 17] File exists: 'TWEETS'

In [71]:
#2
with open("data/twitter_data.json") as file:
    twitter = json.load(file)

    for tw in twitter:
        creation_date = tw["created_at"][:10]
        path = os.path.join("TWEETS", creation_date)
        os.makedirs(path, exist_ok=True)
        name = tw["id"]
        filepath = os.path.join(path, f"{name}.json")
        with open(filepath, "w")as file:
            json.dump(tw, file)
