#  Homework 2

Deadline: March 10th 11:59pm

Hand in: the homework must be handed in through the Moodle system. 

<span style="color:red">Provide a written answer if requested in the exercise! These questions are marked in red.</span>

---

## Exercise 1 

APIs are a great way to collect data for your projects. Here are a few API you can try out:

- https://developer.nytimes.com/docs/archive-product/1/overview
- https://api.wikimedia.org/wiki/Getting_started_with_Wikimedia_APIs
- https://api.semanticscholar.org/api-docs/graph#tag/Paper-Data/operation/get_graph_get_paper_citations

In this exercise you'll collect and clean data from an API. This could be a good oportunity to collect data for your final project. (20%)

**a)** Collect data from an API of your choice. Proccess the returned data so that each unique data feature is a column in a pandas dataframe. Investigate missing data, such as NaN values, and apply a solution. Cast data columns to be the right type for the data they contain. Display the cleaned data frame using .head().

In [None]:
import requests
import json

In [None]:
api_key = 'lt06iVOvngWYAy29i79LsYG8fa99LrVn' 
url = 'https://api.nytimes.com/svc/archive/v1/2019/1.json?api-key=' + api_key

In [None]:
response = requests.get(url).json()['response']['docs']

In [None]:
with open("./Data/nytimes.json", "w") as file:
    json.dump(response, file, indent=4)


Now we can load the data to the dataframe

In [None]:
import pandas as pd

df2 = pd.DataFrame(response)

We notice that print_section and print_page columns have missing data

In [None]:
print(len(df2))
print(len(df2['print_page'].dropna()))
print(len(df2['print_section'].dropna()))


Filter by important columns (and columns that don't have missing data)

In [None]:
keys = ["id", "abstract", "section_name", "headline", "pub_date", "document_type", "word_count"]

In [None]:
df = pd.DataFrame(columns = keys)

In [None]:
from datetime import datetime

In [None]:
for id, article in enumerate(response):
    df_row = { 
        "id": id,
        "abstract": article["abstract"], 
        "section_name": article["section_name"], 
        "headline": article["headline"]["print_headline"], 
        "pub_date": datetime.strptime(article["pub_date"], "%Y-%m-%dT%H:%M:%S%z"), 
        "document_type": article["document_type"], 
        "word_count": int(article["word_count"])
    }
    df = pd.concat([df, pd.DataFrame(df_row, index=[0])])

In [None]:
print(df.head())

**b)** Use the dataframe created in part a) to answer an exploratory data analysis question of your choice. State your question, design a data visualization that answers your question and <span style="color:red">discuss</span>.

Plot how many articles were published for each section weekly. Filter by top 10 topics for each week

In [None]:
df.groupby([df['pub_date'].dt.strftime('%W'),'section_name']).size().unstack().apply(lambda x: x.nlargest(10), axis=1).plot(
    kind='bar',
    stacked=True, 
    title='Number of articles published per week',
    xlabel='Calendar Week in 2019',
    ylabel='Number of articles',
    figsize=(12,8)
)

## Exercise 2 

The [Round University Ranking (RUR)](https://roundranking.com/ranking/world-university-rankings.html#world-2021) evaluates the performance of 867 world’s leading higher education institutions by 20 indicators grouped into 4 key areas of university activity: Teaching, Research, International Diversity, Financial Sustainability. The top 100 universities are placed in the diamond league, the next 100 in the gold league and so on... (40%)

**a)**  Using the scraping techniques covered in class, scrape the following data fields about the universities (from the website linked above): The name of the University, in which country the University is located, their score and league given by the RUR ranking. Then load the data into a Pandas DataFrame called *df* with the following column names: <font style='font-style : oblique'>University</font>, <font style='font-style : oblique'>Country</font>, <font style='font-style : oblique'>Score</font> and <font style='font-style : oblique'>League</font>. 

IMPORTANT: You should not re-scrape the data every time you work on the homework, because we don't want the RUR servers to get overloaded. Instead, scrape the data once and then save it to a local file on your computer (Hint: use the *pd.to_csv()* function), then load the data from this file instead of re-scraping the website.

In [21]:
!pip install beautifulsoup4
!pip3 install selenium
!pip3 install webdriver-manager

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [22]:
import bs4
import selenium

In [23]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://roundranking.com/ranking/world-university-rankings.html#world-2021')

In [24]:
soup = bs4.BeautifulSoup(driver.page_source, 'html.parser')
table = soup.find_all('table')

Make sure we don't have to rescrape the website by saving the table as an html file

In [25]:
with open('./Data/soup.html', 'w') as file:
    file.write(table[0].prettify())

Save the table into a dataframe and clean up

In [26]:
table = soup.find('table')
table_rows = table.find_all('tr')
data = []

for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    data.append(row)


In [28]:
import pandas as pd

In [29]:
df = pd.DataFrame(data, columns=['Rank', 'University', 'Score', 'Country', 'Flag', 'League', 'Continent'])

In [30]:
df = df.drop(df.index[0])
df.reset_index(drop=True, inplace=True)
df.drop(columns=['Rank', 'Flag', 'Continent'], inplace=True)

In [32]:
with open('./Data/university_rankings.csv', 'w') as file:
    df.to_csv(file, index=False)


**b)** Filter the data as follows:
- Filter out the US universities. (The analysis aims to find out which universities rank high outside the USA to help US students in choosing a study abroad program.)
- Only keep the universities in the Diamond, Gold, Silver and Bronze league.
- Sort the dataframe by score. <span style="color:red">Which are the top 5 ranking universities?</span>

In [33]:
df = pd.DataFrame()

with open('./Data/university_rankings.csv', 'r') as file:
    df = pd.read_csv(file)

In [36]:
study_abroad_unis = df[
    (df["Country"] != "USA") & 
    df["League"].isin([
        "Diamond League",
        "Golden League", 
        "Silver League", 
        "Bronze League"
        ])
    ]
study_abroad_unis = study_abroad_unis.sort_values(by="Score", ascending=False)

In [38]:
print(study_abroad_unis.head(5))

                                          University   Score      Country  \
3                            Imperial College London  96.742           UK   
4                               Karolinska Institute  96.609       Sweden   
6                               University of Oxford  96.167           UK   
7  ETH Zurich (Swiss Federal Institute of Technol...  94.629  Switzerland   
8                            University of Cambridge  94.383           UK   

           League  
3  Diamond League  
4  Diamond League  
6  Diamond League  
7  Diamond League  
8  Diamond League  


Top 5 ranking universities outside US are the following

- Imperial College London
- Karolinska Institute
- University of Oxford
- ETH Zurich
- University of Cambridge

**c)** Create a word cloud from the Mission Statements of the top Universities. We have already scraped these statements for you. You can find the scraped data [here](https://math.bme.hu/~pinterj/BevAdat1/Adatok/wordcloud.txt)! <br>
- Load the text data from this site into a string variable! (Hint: You can load the data with *urlopen* as shown in Notebook2)
- Omit the word university from the data!
- Create a word cloud, then <span style="color:red">describe what you see in 2-3 sentences!</span>

(Hint: You can find more information on how to create a Word Cloud at https://www.datacamp.com/community/tutorials/wordcloud-python)

In [None]:
!pip install WordCloud

In [51]:
from wordcloud import WordCloud
from urllib.request import urlopen
import matplotlib.pyplot as plt

In [49]:
url = "https://math.bme.hu/~pinterj/BevAdat1/Adatok/wordcloud.txt"
source = urlopen(url).read().decode('utf-8')

In [50]:
source = str.replace(source, "university", "")
source = str.replace(source, "University", "")

In [None]:
wordcloud = WordCloud().generate(source)

In [None]:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

---

## Exercise 3 

Let's improve on the model used in class: decision tree on the bank dataset. (40%) 

**a)** In class we solved a classification problem on the *bank.csv* dataset using the kNN algorithm. The classifier performed poorly. Repeat the analysis carried out in class (based on the Notebook02), but now use a decision tree, set the maximum depth to be 6! **Hint:** Use the *tree.DecisionTreeClassifier* classifier!


**b)** Now fit the tree using different parameters! Plot the ROC curve of the decision tree obtained in part a) and the new tree in the same figure (with different colors). Also plot the *y=x* diagonal line!

**c)** Plot and interpret the decision tree. The easiest way to do this is with the sklearn.tree.plot_tree function. Here's a useful article: https://pythoninoffice.com/how-to-a-plot-decision-tree-in-python/.

* If you would like, you can try plotting the decision tree using the graphviz package too. **Hints:** Visualize the decision tree trained in part a) using the *tree.export_graphviz* function. To present the tree use the *graphviz.Source* function or the *SVG* function of the *Ipython.display* package! If *graphviz* is not installed you can install it using the Anaconda Navigator or by using *pip install* or by installing with homebrew, *brew install graphviz*. If it doesn't seem to work you can also download it from this [link](https://graphviz.gitlab.io/download/) and inserting the following lines of codes (use the correct path for your downloaded file):<br><br>
import os <br>
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin') <br><br>

* <span style="color:red">Briefly interpret the results! According to what attribute did we split the tree first? Which were the usual splitting attributes? </span>

**d)** Plot the feature importances for the decision tree. The link in part c) also has useful information about this. <span style="color:red"> Briefly interpret. </span>

## Works Cited:

Please cite all external resources you used to complete this assignment. If you used ChatGPT, please include a link to the conversation.

(https://stackoverflow.com/questions/45281297/group-by-week-in-pandas)[df]

1.a [adding new rows to df](https://stackoverflow.com/questions/75956209/error-dataframe-object-has-no-attribute-append)

1.b [grouping dates by week](https://stackoverflow.com/questions/45281297/group-by-week-in-pandas)

[pandas cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)

2.b [filtering columns by multiple conditions](https://chat.openai.com/share/ad7425b9-8f06-4c33-9922-02739ea46cd8)