## Imports

In [None]:
import urllib, json

# HTML parsing
from lxml import html
import requests

## Getting all the Spanish movies

Source: Wikipedia. https://es.wikipedia.org/wiki/Categor%C3%ADa:Pel%C3%ADculas_de_Espa%C3%B1a

Since there was no existing database with Spanish films and their scores in the Bechdel test, it was necessary to create one of our own.
To do so, we have opted to scrape the information of all the Spanish production (or co-production) films that we could find (source: Wikipedia).
In addition, we have made use of the API offered by https://bechdeltest.com/ (the API documentation can be accessed via the following link: https://bechdeltest.com/api/v1/doc). Thanks to the method "getMovieByImdbId" it has been possible to obtain the Bechdel score of 172 films (out of the 3 149 that were scraped from Wikipedia).
To do this, it was necessary to obtain the IMDb ID of each of the films. This was achieved by means of XPath expressions.

<img src="img/wikipedia_imdbid.png" width=1000 height=800/>

In [None]:
wikipedia_spanish_movies_imdbid_list = []

# AQUÍ HAY QUE METER UNA LISTA CON LOS LINKS DE LAS DISTINTAS PÁGINAS DE 200 EN 200 PELÍCULAS
# O IR SCRAPEANDO EL ENLACE DE LA SIGUENTE PÁGINA HASTA LA ÚLTIMA

for i in range(1):
    wikipedia_page = requests.get('https://es.wikipedia.org/w/index.php?title=Categor%C3%ADa:Pel%C3%ADculas_de_Espa%C3%B1a&pagefrom=Un+verano+para+matar#mw-pages')
    wikipedia_tree = html.fromstring(wikipedia_page.content)
    wikipedia_links_list = wikipedia_tree.xpath('//div[@class="mw-category-group"]/ul/li/a[@href]')
    wikipedia_links_href_list = [f'https://es.wikipedia.org/{link.get("href")}' for link in wikipedia_links_list]
    for movie_link in wikipedia_links_href_list:
        movie_page = requests.get(movie_link)
        movie_tree = html.fromstring(movie_page.content)
        movie_imdbid = movie_tree.xpath('//td[@class="navbox-list navbox-odd"]/div/ul/li[5]/span[@class="uid"]/a/text()')
        if movie_imdbid:
            movie_imdbid = movie_imdbid[0][2:]
            print(movie_imdbid)
            wikipedia_spanish_movies_imdbid_list.append(movie_imdbid)
        else:
            print('List is empty')

## Getting Spanish movies that have Bechdel Score alredy rated

As previously mentioned, once the IMDb IDs of all the films produced in Spain have been obtained, requests are made to the Bechdel test API.
As expected, as these are Spanish films, not all of them have been tested and collected in this database, so we have to deal with these situations in which the API returns an empty list.
When the movie is in the database, the API returns a JSON, as shown in the following image.

<img src="img/json_example_bechdel.png" width=300 height=300/>

Once the list with the JSON objects is obtained, we convert it to a data frame.
The reason why we add all the elements to a list first and then convert it to a data frame instead of adding each JSON object as a new row of a data frame is because the runtime is much longer than adding elements to a list. This can be seen in the following graph.

<img src="img/runtime_list_df_append.png" width=600 height=600/>

In [None]:
df_wikipedia_spanish_movies_list = []
for wikipedia_spanish_movie_imdbid in wikipedia_spanish_movies_imdbid_list:
    http_request = f'http://bechdeltest.com/api/v1/getMovieByImdbId?imdbid={wikipedia_spanish_movie_imdbid}'
    new_wikipedia_movie = pd.read_json(http_request, typ="series")
    df_wikipedia_spanish_movies_list.append(new_wikipedia_movie)

df_wikipedia_spanish_movies = pd.DataFrame(df_wikipedia_spanish_movies_list)

The final data frame will also contain films that could not be found in the Bechdel test database. It is easy to discard these rows, because in the column "description" they contain the string "Could not find movie".

In [None]:
df_wikipedia_spanish_movies = df_wikipedia_spanish_movies.drop(df_wikipedia_spanish_movies[df_wikipedia_spanish_movies.description == "Could not find movie"].index)
df_wikipedia_spanish_movies.reset_index(inplace=True)

Finally, the data frame information is saved in a CSV file.

In [None]:
df_wikipedia_spanish_movies.to_csv("wikipedia_spanish_movies_bechdel_new.csv", mode='a', columns=["index", "status", "version", "description", "date", "dubious", "year", "visible", "rating", "title", "submitterid", "id", "imdbid"], header=False, index=False)

## Getting the IMDb rating of Spanish movies

Because one of the analyses to be carried out needs to know the rating of each film, it is necessary to scrape this information, this time from the IMDb website. The f-string is used to generate the URLs of each film's page in IMDb, which will be scraped to obtain the rating.

<img src="img/url_rating_imdb.png" width=1000 height=800/>

<img src="img/html_rating_imdb.png" width=1000 height=800/>

In [None]:
wikipedia_spanish_movies_imdb_id_rating = {}
imdbid_list = df_spanish_movies_bechdel["imdbid"].to_list()
for imdbid in imdbid_list:
    imdb_movie_page = requests.get(f'https://www.imdb.com/title/tt{imdbid}/?ref_=fn_tt_tt_1')
    imdb_movie_tree = html.fromstring(imdb_movie_page.content)
    rating_movie = imdb_movie_tree.xpath('(//span[@class="sc-7ab21ed2-1 jGRxWM"])[1]/text()')
    wikipedia_spanish_movies_imdb_id_rating[imdbid] = rating_movie[0] if rating_movie else print(f'No rating for {imdbid}')

In [None]:
spanish_movies_ratings_dict = {"imdbid": list(wikipedia_spanish_movies_imdb_id_rating.keys()), "rating": list(wikipedia_spanish_movies_imdb_id_rating.values())}
spanish_movies_ratings_df = pd.DataFrame.from_dict(spanish_movies_ratings_dict)
spanish_movies_ratings_df.to_csv("spanish_movies_ratings.csv", index=False)