In [1]:
import requests
from bs4 import BeautifulSoup

This project uses two packages requests and BeautifulSoup. <br>
Python requests are used to send a GET request to the specified url. <br>
Then, content of the response, in unicode, is forwarder to BeatufulSoup which allows to pull data from html. <br>

From movie page name of the movie, genre and description is collected. <br>
All movie pages on https://tvprofil.com/ have same structure, after a little inspection it is noticed that:
* genre is available as content of span tag with itemprop attribute 'genre'
* description is availble as content of p tag with itemprop attribute 'description'
* movie name is availble as content of h1 tag with itemprop attribute 'name'

In [2]:
def get_data_from_page(url: str) -> dict:
    """
    For given url, if response is 200, returns movie name, genre(s) and description.
    Otherwise returns empty dictionary.
    """
    response = requests.get(url)
    if response.status_code == 200:
        soup=BeautifulSoup(response.text, features='html.parser')
        genre=[]
        for s in soup.find_all('span', attrs={'itemprop':'genre'}):
            genre.append(s.text)
        desc = soup.find_all('p', attrs={'itemprop':'description'})[0].text
        name = soup.find_all('h1', attrs={'itemprop':'name'})[0].text
        return { 'name': name, 'genre': genre, 'desc': desc }
    else:
        return { }

Movie urls are available on pages https://tvprofil.com/film/?page=XX&packages=0' where XX is page number. <br>
First page has a little different url: https://tvprofil.com/film/?&packages=0. <br>
On each page there are 15 movies urls avialable as href in tags a with itemprop 'url_name'. <br>
Below code will go through all pages where are available moview urls, then it will collect data from movie urls with above explained function.

In [3]:
pages_with_not_ok_response = []
movie_urls = []
movies_with_not_ok_response = []
rows = []

In [4]:
first_url = 'https://tvprofil.com/film/?&packages=0'
first_response = requests.get(first_url)
if first_response.status_code==200:
    first_t = first_response.text 
    first_soup = BeautifulSoup(first_t, features='html.parser')
    first_mup = first_soup.find_all('a', attrs={'itemprop':'url name'})
    for fup in first_mup:
        movie_url='https://tvprofil.com' + fup['href']
        data_dict = get_data_from_page(movie_url)
        if len(data_dict)>0:
            rows.append(data_dict)
        else:
            movies_with_not_ok_response.append(movie_url)
else:
    pages_with_not_ok_response.append(1)

In [5]:
for page in range(2,82):
    url='https://tvprofil.com/film/?page=' + str(page) + '&packages=0'
    response = requests.get(url)
    if response.status_code==200:
        t = response.text 
        soup = BeautifulSoup(t, features='html.parser')
        mup = soup.find_all('a', attrs={'itemprop':'url name'})
        for up in mup:
            movie_url='https://tvprofil.com' + up['href']
            data_dict = get_data_from_page(movie_url)
            if len(data_dict)>0:
                rows.append(data_dict)
            else:
                movies_with_not_ok_response.append(movie_url)
    else:
        pages_with_not_ok_response.append(page)

In [6]:
print('number of rows:', len(rows))
print('movies_with_not_ok_response: ', movies_with_not_ok_response)
print('pages_with_not_ok_response: ', pages_with_not_ok_response)

number of rows: 1215
movies_with_not_ok_response:  []
pages_with_not_ok_response:  []


In [9]:
import pandas as pd

In [10]:
df=pd.DataFrame(rows)
df.head()

Unnamed: 0,name,genre,desc
0,Ubojice,"[akcija, komedija, triler, romantika]","Jen (K. Heigl) je lijepa, privlačna i napušten..."
1,Malci,"[animirani, akcija, komedija, obiteljski, krim...","Pridružite se zaštitničkom vođi Kevinu, mladom..."
2,La Llorona,"[triler, horor, drama, dokumentarni, ratni, kr...",Alma je ubijena sa svojom djecom tijekom vojno...
3,Kostim za Nicholasa,"[fantazija, animirani, obiteljski]","Film govori o Nicholasu, 10-godišnjem dječaku ..."
4,Lumeni,"[drama, komedija, romantika]","U život profesora udovca ulazi nova ljubav, al..."
