**Part Two: Analysis**


*   Identifying key words that tells what each article is about
*   Grouping similar articles together

**Imports**

In [486]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.decomposition import LatentDirichletAllocation
tfidf = TfidfVectorizer()

import nltk
import string
import pandas as pd
import numpy as np



In [487]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**Import articles**

In [488]:

df = pd.read_csv('apoliticalArticles.csv')

In [489]:
df = df.drop('Unnamed: 0', axis=1)

In [490]:
data = df['Content']
data.head()

0    Climate-related displacement and migration is ...
1    Imagine being asked to identify and cut £140 m...
2    Refugees in Ghana have the same legal rights a...
3    Luxembourg recently became the first country i...
4    This article is written by Juan Vila, director...
Name: Content, dtype: object

**Preprocessing**

In [491]:
def convert_lower_case(data):
    return np.char.lower(data)

In [492]:
def remove_stop_words(data):
    stop_words = stopwords.words('english')
    words = word_tokenize(str(data))
    new_text = ""
    for w in words:
        if w not in stop_words and len(w) > 1:
            new_text = new_text + " " + w
    return new_text


In [493]:
def remove_punctuation(data):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n"
    for i in range(len(symbols)):
        data = np.char.replace(data, symbols[i], ' ')
        data = np.char.replace(data, "  ", " ")
    data = np.char.replace(data, ',', '')
    return data

In [494]:
data=data.apply(lambda x: convert_lower_case(x))
data=data.apply(lambda x: remove_stop_words(x))
data=data.apply(lambda x: remove_punctuation(x))


In [495]:
data.head()

0     climate related displacement migration set gr...
1     imagine asked identify cut £140 million budge...
2     refugees ghana legal rights ordinary citizens...
3     luxembourg recently became first country worl...
4     article written juan vila director open gover...
Name: Content, dtype: object

**Calculate DF-IDF score of words in each article to determine words with high importance**

In [496]:
x = tfidf.fit_transform(data)

In [497]:
important_words = tfidf.vocabulary_

In [498]:
#Top 20 Important words in the articles

important_words = list(important_words.items())[:20]

important_words


[('climate', 447),
 ('related', 2000),
 ('displacement', 737),
 ('migration', 1525),
 ('set', 2159),
 ('greatest', 1107),
 ('challenge', 407),
 ('era', 870),
 ('communities', 488),
 ('arid', 206),
 ('semi', 2140),
 ('lands', 1369),
 ('particularly', 1713),
 ('vulnerable', 2586),
 ('effects', 800),
 ('change', 410),
 ('exposure', 926),
 ('extreme', 930),
 ('temperature', 2385),
 ('irregularity', 1324)]

**Perform Topic Modelling on articles to identify the main theme of the articles**

In [499]:
no_features = 1000
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features,stop_words='english')
tf = tf_vectorizer.fit_transform(data)
tf_feature_names = tf_vectorizer.get_feature_names()
no_topics = 14

# Run LDA (Latent Dirichlet allocation)
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=500, learning_method='online', learning_offset=50.,random_state=0).fit(tf)

In [500]:
# Function to display the topics and words
def display_topics(model, feature_names, no_top_words):
    words_list = []
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic", topic_idx)
        print (" ".join([feature_names[i]
        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        words_list.append(" ".join([feature_names[i]
        # Append topic to list to visualise in wordcloud                            
        for i in topic.argsort()[:-no_top_words - 1:-1]]))
    return ' '.join(words_list)
no_top_words = 15
topic_words = display_topics(lda, tf_feature_names, no_top_words)

Topic 0
health refugees refugee services training status integration centres work new people migrant public states care
Topic 1
local regional government australia communities new job work need workforce areas migration different public leadership
Topic 2
developments decision said felt evidence council using solutions populations staff citizens day integration devices person
Topic 3
credit decision expectancy equal immediately team overhaul opportunities 10 unlike key relationship care circumstances require
Topic 4
council staff said deal training people change local town different values approach asked new taking
Topic 5
data digital public opening experience step limited working sessions team wanted spaces decisions right number
Topic 6
citizens source relationship developed seasonal today transportation identified solve county analysis servants face coordination felt
Topic 7
role social responsible remain commons million financial chronic unsplash changing regularly rapid practices

**TF-IDF (term frequency–inverse document frequency) and euclidean distance for article similarity calculation**

In [501]:
content_list = list(data)
print(data)
vectorizer = CountVectorizer()
features = vectorizer.fit_transform(content_list).todense()

for f in features:
  print (euclidean_distances(features[1], f))


0      climate related displacement migration set gr...
1      imagine asked identify cut £140 million budge...
2      refugees ghana legal rights ordinary citizens...
3      luxembourg recently became first country worl...
4      article written juan vila director open gover...
5      coronavirus crisis sweeping globe health data...
6      meanwhile refugees became settled turkey need...
7      discover inspiring tools resources policies c...
8      piece written nick kimber public policy profe...
9      piece written audrey macklin professor chair ...
10     article written madhu raghunath sector leader...
11     opinion piece written jack archer ceo regiona...
12     article written zachary spicer director resea...
13     article written mathew yarger head smart citi...
14     article written andrew phillips national mana...
Name: Content, dtype: object
[[57.39]]
[[0.]]
[[56.44]]
[[64.61]]
[[64.75]]
[[74.23]]
[[64.95]]
[[55.43]]
[[57.26]]
[[68.33]]
[[64.05]]
[[60.61]]
[[66.29]]
[[70

# Summary of findings

**Part one**

I started this section by attempting to scrape the URL provided using python beautiful soup library but I faced some challenges because the page does not load HTML content directly on load. I resolved this by using python selenium library which emulates the way a user would have opened the link in the browser, this way I was able to get the HTML data I need for beautifulsoup to process. This attempt would also scale for more URL without rate limit issues.

I exported the scrapped data into a CSV file for further analysis in part two.

**Part Two: Analysis**

In this part, I analysed the data for important words and group similar articles together.

I performed data preprocessing and used TF-IDF(Term Frequency, Inverse Document Frequency) to identify words that are of high importance in the article. I identified words like "climate", "migration", "communities". This word shows the central idea of the articles analysed.

I used Eculendean distance to determine how similar articles are to each other. we will notice that article 6 (74.23) and Article 7 (70.56) are similar. This two articles covered topics related to smart city/future city. Article 1(57.39) and Article 3(56.44) are also similar, they both cover topics about refugees.