# Fundamentals of Data Science - Week 3 and Week 4 

###  <span style='color: green'>See Canvas croll down to the bottom of the notebook for assignment details</span> 
<p></p>
<span style='color: red'>Deadline: 25/09/20, 23:59:59 </span>


In this notebook we are going to cover the following practical aspects of data science:

+ Gathering data (scraping the Twitter Streaming API)
+ Storing and organizing it (store to file or a database)
+ Preprocess the data
+ Perform sentiment, topical and correlation analysis
+ Visualize

To complete this assignment you need to have a running Anaconda installation with Python 3.8 on your device. Python package prerequisites include:

+  **Twitter API Client** [Tweepy](https://github.com/tweepy/tweepy) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [Install command: **pip install tweepy**]
+  **Python Data Analysis Library** [Pandas](https://pandas.pydata.org/pandas-docs/stable/install.html)  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  [Install command: **pip install pandas**]
+  **Python Visualization Library** [MatPlotLib](https://matplotlib.org/)   &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; [Install command: **python -m pip install matplotlib**]
+  **Python Topic Modelling Library** [GENSIM](https://radimrehurek.com/gensim/install.html) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  [Install command: **pip install --upgrade gensim**]

An additional requirement if **you would like to use a database** is MongoDB (Community Server):
+ MongoDB database server instance [MongoDB Installation Instructions](https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/#install-mongodb-community-edition)
+ **Windows download** (and perhaps linux, untested): [link](https://www.mongodb.com/download-center?jmp=nav#community)
+  **Python-Mongo Database Client** [PyMongo](https://api.mongodb.com/python/current/)  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  [Install command: **python -m pip install pymongo**]

## Gathering Data - Twitter API

The public Twitter API consists of a REST API and a Streaming API. Most application developers mix and match the APIs to produce their application. The Streaming API provides low-latency high-volume access to Tweets. Additionally, there are some families of APIs (such as the Ads API) which require your application to be whitelisted in order to make use of them. For this assignment we are going to use the [**Twitter Streaming API**](https://dev.twitter.com/streaming/overview).

### Twitter Streaming API

The Streaming APIs give developers low latency access to Twitter‚Äôs global stream of Tweet data. A streaming client will be pushed messages indicating Tweets and other events have occurred, without any of the overhead associated with polling a REST endpoint.

Twitter offers several basic streaming endpoints, each customized to certain use cases:
+ **Public Streams** - Streams of the public data flowing through Twitter. Suitable for following specific users or topics, and data mining.
+ **User Streams** &nbsp;&nbsp;&nbsp;- Single-user streams, containing roughly all of the data corresponding with a single user‚Äôs view of Twitter.
+ **Site Streams** &nbsp;&nbsp;&nbsp;&nbsp;- The multi-user version of user streams. Site streams are intended for servers which must connect to Twitter on behalf of many users. Site Streams is a closed beta. Applications are no longer being accepted.

In this assignment we are going to use the **Twitter Public Streams** to gather data about certain topics of interest. For using the Twitter API we need to create a Twitter Account, a Twitter APP and obtain the API Keys.

### Obtaining Twitter API Keys

In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token and Access token secret. Follow the steps below to get all 4 elements:

+ Create a twitter account if you do not already have one.
+ Go to https://apps.twitter.com/ and log in with your twitter credentials.
+ Click "Create New App"
+ Fill out the form, agree to the terms, and click "Create your Twitter application"
+ In the next page, click on "API keys" tab, and copy your "API key" and "API secret".
+ Scroll down and click "Create my access token", and copy your "Access token" and "Access token secret".

### Connecting to Twitter Streaming API and downloading data

Now that we have the necessary credentials we can use the Tweepy library we installed in the previous step to connect to Twitter and start gathering data.

First we import the required methods from the Tweepy library:


In [1]:
# Import the necessary methods from tweepy library

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream


Next we copy the credentials in separate variables that we are going to use through the entire assignment.

**NOTE** it is a general best practise not to keep sensitive data like API keys in a raw form in your scripts. For simplicity and demonstration purposes we can do that in this excercise, however this is not acceptible in a real live scenario.

In [2]:
consumer_key = 'TmjrH4cD4cG7RfoiN42wpGAwg'
consumer_secret = 'tfoOB4P3yx2XjCGXgzeOa5eQ70WBmEHTNA7Gld8RcwPECSaiHK'
access_token = '597732281-eYkrFH4gIQ67i9l2GzifdvKtc6Zxfyp7cgRxRPEr'
access_token_secret = 'n367uA0lexHNcfpBBmrTbNycYqcwjdnilDWdHLCJMUzl3'

Next we specify:
+ The location where we are going to dump the tweets that we obtained through the Streaming API
+ A basic function that formats and stores the tweets in a text file for later usage
+ A class consisting of a listener that attaches to a particular stream and displays the tweets directly onto the screen.

In [3]:
# We need to import json for dumping the tweets into our file
import json 

# Here we specify where the tweets would be stored
tweets_collection = 'tweets.txt'
tweet_file = open(tweets_collection, 'a')

#This a basic python function to append some value to a text file
def dump_tweet_to_json(tweet, dump_file):
     dump_file.write(tweet + '\n')
    
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    # When we get data through the api print the data on screen
    def on_data(self, data):
        print(data)
        return (True)

    # When an error occures print the status code so that we know what it is.
    def on_error(self, status):
        print (status)

Using the defined class we can authenticate using the example below and attach our listener to a stream that is particularly interested in these topics:
+ Data Science
+ University Of Amsterdam
+ Python
+ Artificial Inteligence

The code section below **will not stop automatically** once you run it (which is the whole point of the streaming API). To stop the execution and move on to the next section interupt the kernel using the **stop symbol** in the top toolbar.

#### Warning, you may have to restart the kernel to halt the stream depending on the Anaconda/tweepy version 

In [None]:
if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API

    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    stream.filter(track=['data science', 'university of amsterdam', 'python', 'artificial intelligence'])

If we now modify our Listener class to store the tweets to a file instead of printing it on screen, we would be able to use the tweets for our later analysis. You may need to create a 'data' folder - depending on the system you may need to create this manually or it will be automatic.

In [4]:
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    # When we get data through the api print the data on screen
    def on_data(self, data):
        print ("Stored a new tweet.")
        dump_tweet_to_json(data, tweet_file)
        return True

    # When an error occures print the status code so that we know what it is.
    def on_error(self, status):
        print ("Error: ", status)

If we run the main section for a short period of time again and check the tweets.txt file in the data folder we will find the streamed tweets.

#### Warning, you may have to restart the kernel to halt the stream depending on the Anaconda/tweepy version 

In [None]:
if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API

    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    stream.filter(track=['data science', 'amsterdam', 'python', 'artificial intelligence', 'netherlands'])

### Optional MongoDB check


If MongoDB is installed on your device and a database named Twitter is created, the tweets can be stored as database entries using the following code:

**Note**: If Mongo is not installed on your device it will yield an error Connection Refused exception (on Windows it could be 'actively refused'). You need to install MongoDB community server.

In [None]:
from pymongo import MongoClient

client = MongoClient()
db = client.test

db.twitter.insert_one({'sample':'tweet'})

The final line of the code would replace the **dump_tweet_to_json()** function call in the Listener class.

## Preprocessing the data

Assuming that our stream listener has been running for a while and we have gathered some tweets, our tweets.txt file has grown to contain quite a few tweets now. If we open the file and read it line by line, we can import the tweets as json objects in a list and see their contents:

In [5]:
#Pprint is 'pretty print', simply a print function that gives 'nicer' outputs than print
from pprint import pprint

tweets_data = []
tweets_file = open(tweets_collection, "r")

for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
        print ('Imported tweet created at:', tweet['created_at'])
        print ('Tweet content: \n', tweet['text'], '\n')
    except Exception as e:
        print (e)
        continue

print ('#############################################' )       
print ('We have gathered:',len(tweets_data), 'tweets.')
print ('#############################################' ) 

print ("Information contained in a single tweet: \n")
pprint(tweets_data[0].keys())


Imported tweet created at: Thu Sep 17 12:33:03 +0000 2020
Tweet content: 
 #WO2 (20-05-1940 (Opname)) Het koffiehuis, Amsterdam. https://t.co/DJwmf1lAeU NIOD: https://t.co/yq1Gm4VfVt 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:33:06 +0000 2020
Tweet content: 
 RT @cyberpredator01: Click on below link to read...

https://t.co/R8733RsYt8

#computer #technology #pc #tech #gaming #laptop #computerscie‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:33:06 +0000 2020
Tweet content: 
 RT @cyberpredator01: Click on below link to read...

https://t.co/R8733RsYt8

#computer #technology #pc #tech #gaming #laptop #computerscie‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:33:08 +0000 2020
Tweet content: 
 RT @TruthSeeker____: Footage Of Biden Touching Girls On C-Span Flagged As Child Sexual Exploitation By By Twitter's Artificial Intelligence‚Ä¶ 

Expecting value: line 2 colum

Imported tweet created at: Thu Sep 17 12:34:53 +0000 2020
Tweet content: 
 RT @btsdutchstats: Spotify Netherlands üá≥üá± 

#29 (+3) *new peak* ‚ÄòDynamite‚Äô 95,160 üîº

Total: 1,904,011

@BTS_twt https://t.co/Ab4FZvmbpj 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:34:54 +0000 2020
Tweet content: 
 RT @gp_pulipaka: The Essential Amazon #AWS Books for Cloud Professionals. #BigData #Analytics #DataScience #IoT #IIoT #Python #RStats #Tens‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:34:54 +0000 2020
Tweet content: 
 Click on below link to read...

https://t.co/R8733RsYt8

#computer #technology #pc #tech #gaming #laptop‚Ä¶ https://t.co/PqrEEgCffV 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:34:55 +0000 2020
Tweet content: 
 @banaanpankoek @turtlejoestin Jaa preciess self made brands! Ik ken en paar van jonge ondernemers in amsterdam en z‚Ä¶ https://t.co/wRvLaqokGo 


Imported tweet created at: Thu Sep 17 12:37:04 +0000 2020
Tweet content: 
 14:37:04 P 1 Ongeval Wegvervoer Letsel Verlengde Stellingweg Amsterdam 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:37:04 +0000 2020
Tweet content: 
 14:37:04 P1 ongeval wegvervoer letsel verlengde stellingweg amsterdam 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:37:04 +0000 2020
Tweet content: 
 P 1 Ongeval Wegvervoer Letsel Verlengde Stellingweg Amsterdam https://t.co/kpYijsiGxn #p2000 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:37:05 +0000 2020
Tweet content: 
 üî¥ #AMSTERDAM #AML üöì #p2000 
P 1 Ongeval Wegvervoer Letsel Verlengde Stellingweg Amsterdam
 https://t.co/wqCbV9KNvt 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:37:05 +0000 2020
Tweet content: 
 P 1 Ongeval Wegvervoer Letsel Verlengde Stellingweg Amsterdam https://t.co/GvE1u77gkh #p2000 


Imported tweet created at: Thu Sep 17 12:39:01 +0000 2020
Tweet content: 
 RT @realtoughcandy: My 5th book Portfolio Surgery is dropping November 16! 

It shows you 5 methods for making #software projects your own‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:39:01 +0000 2020
Tweet content: 
 RT @couponed_code: Ensemble Machine Learning in Python : Adaboost, XGBoost 

https://t.co/SszTchJpyi

#MachineLearning #MachineTools #DeepL‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:39:02 +0000 2020
Tweet content: 
 RT @TrainingSadi: This morning's Sales Team Pre-call planning for the upcoming virtual conference on: ùó£ùóøùóºùó¥ùóøùóÆùó∫ùó∫ùó∂ùóªùó¥ ùòÑùó∂ùòÅùóµ ùó£ùòÜùòÅùóµùóºùóª ùó≥ùóºùóø ùóòùòÖùó≤ùó∞ùòÇùòÅùó∂ùòÉùó≤‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:39:03 +0000 2020
Tweet content: 
 RT @PomandaCo: The Netherlands: üá≥üá± 

 RT @glitchbotio: "Repetition does not transform a lie into a truth."- Franklin D. Roosevelt #softwareengineer #javascript #python #C++ #gol‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:40:58 +0000 2020
Tweet content: 
 RT @Datascience__: The Impact of Artificial Intelligence on Workspaces https://t.co/wessiKcjrJ  #ArtificialIntelligence 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:40:59 +0000 2020
Tweet content: 
 RT @Jude_Pullen: RadioGlobe goes LIVE!! 
Not only a cool project to discover music, news and ideas - but arguably a 101 in Design, Tech, 3D‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:40:59 +0000 2020
Tweet content: 
 RT @UberFacts: In 2013, 19 prisons in the Netherlands closed because the country didn't have enough criminals to fill them. 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:41:00 +0000 2020
Tweet cont

Imported tweet created at: Thu Sep 17 12:43:24 +0000 2020
Tweet content: 
 How ‚Äúgreen‚Äù is your Artificial Intelligence? - Artificial intelligence (AI) s https://t.co/haYqyQU5eW #ai #intoAInews 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:43:25 +0000 2020
Tweet content: 
 RT @ikeoluwabamise: September Writing challenge
Day 15 : If you could run away, where would you go?

Europe ( Iceland, Netherlands, Finland‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:43:26 +0000 2020
Tweet content: 
 RT @cppsecrets: C++ std::for_each with std::vector
https://t.co/34mKKM9yVM

#Cppsecrets #cppsecrets #python3 #python #cplusplus #programmin‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:43:27 +0000 2020
Tweet content: 
 Netizens po ba ang tawag sa mga taga Netherlands? 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:43:27 +0000 2020
Tweet co

Imported tweet created at: Thu Sep 17 12:45:03 +0000 2020
Tweet content: 
 RT @Bachvereniging: Onder toeziend oog van Rembrandts 'De Nachtwacht' speelt Richard Egarr het Concert in D groot, BWV 972. Uitgevoerd op e‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:45:04 +0000 2020
Tweet content: 
 RT @PrimeClasses_: Discriminative Model - Machine Learning Glossary ?

#primeclasses #datascience #Python #libraries #machinelearning #arti‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:45:04 +0000 2020
Tweet content: 
 Dark side of #AI: How to make artificial intelligence trustworthy. https://t.co/4JB4okfnqz? @avivahl @Gartner_inc‚Ä¶ https://t.co/Gasvfzjuq7 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:45:04 +0000 2020
Tweet content: 
 Something hilarious about waiting for the the 25th anniversary (according to Wikipedia) of your library before writ‚Ä¶ https://t.co/g8ITZil5z

Imported tweet created at: Thu Sep 17 12:46:40 +0000 2020
Tweet content: 
 RT @Bachvereniging: Onder toeziend oog van Rembrandts 'De Nachtwacht' speelt Richard Egarr het Concert in D groot, BWV 972. Uitgevoerd op e‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:46:40 +0000 2020
Tweet content: 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:46:40 +0000 2020
Tweet content: 
 RT @ProgrammingHero: [Guidelines] Programming for Beginners!
.
.
.
.
.
https://t.co/uBvXSlkOmU

#BigData #Analytics #DataScience #DeepLearn‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:46:41 +0000 2020
Tweet content: 
 RT @ianmSC: Nashville‚Äôs case numbers in bars and restaurants were so low they literally were emailing each other about how best to hide tha‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:46:41 +0000 2020
Tweet content: 
 RT @hype4com: If you

Imported tweet created at: Thu Sep 17 12:48:33 +0000 2020
Tweet content: 
 Jagmeet isn‚Äôt smart - have to ignore  him, he just rants about things without proper understanding of facts and dat‚Ä¶ https://t.co/c1TYYV0T4K 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:48:33 +0000 2020
Tweet content: 
 C++ program : Given level order traversal of a binary tree, check if the tree is a min-heap
https://t.co/Czgo1eSXI6‚Ä¶ https://t.co/4aFvCxKrJ2 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:48:34 +0000 2020
Tweet content: 
 RT @sh6f: „ÅØ„ÅÑ„ÄÅC++„Åã„ÇâJava„Å´„ÄÇ„Åù„Åó„Å¶ÊúÄËøë„ÅØPython„Å∏„ÄÇ
https://t.co/YPYu1gpgMH 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:48:36 +0000 2020
Tweet content: 
 RT @hope_ssr: NCB got leads that Prime accused Rhea used to order drugs from outside India, Especially from Netherlands #ArrestSSRKillersNow 

Expecting value: line 2 column 1 (char 1)
Im

Imported tweet created at: Thu Sep 17 12:50:24 +0000 2020
Tweet content: 
 RT @DrBotsvadze: AI 100: The Artificial Intelligence Startups Redefining Industries @CBinsights #healthcare #healthtech #fintech #transport‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:50:24 +0000 2020
Tweet content: 
 RT @ianmSC: Nashville‚Äôs case numbers in bars and restaurants were so low they literally were emailing each other about how best to hide tha‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:50:27 +0000 2020
Tweet content: 
 RT @036moji: day47 45m #100DaysOfCode IP15m
#Python #paiza

‚úÖÂïèÈ°åÈõÜ D„É©„É≥„ÇØÊó©Ëß£„Åç„Çª„ÉÉ„Éà 20/20

„ÅäÁñ≤„Çå„Åï„Åæ„Åß„Åó„Åü
„Åä„ÇÑ„Åô„Åø„Å™„Åï„ÅÑ

#„Éó„É≠„Ç∞„É©„Éü„É≥„Ç∞Â≠¶Áøí #‰ªäÊó•„ÅÆÁ©ç„Åø‰∏ä„Åí 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:50:28 +0000 2020
Tweet content: 
 RT @lukejostins: We are recruiting a new data science PI to our institu

Imported tweet created at: Thu Sep 17 12:52:05 +0000 2020
Tweet content: 
 New #job: Release Manager (Salesforce Veeva CRM) Location: Netherlands Salary: 50ph - 60ph ..‚Ä¶ https://t.co/S8zzAQCHnB 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:52:07 +0000 2020
Tweet content: 
 RT @alferdi: #BoneiruTinTwitter any Bonairians living in the Netherlands who feel like taking a FREE day trip on Sunday Sept 27 to the Bonn‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:52:07 +0000 2020
Tweet content: 
 RT @een_keuze: "Amsterdam Heeft Een Keuze" is een Volksinitiatief om het massatoerisme te beperken. Wij hebben het College van B&amp;W van Amst‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:52:07 +0000 2020
Tweet content: 
 RT @archangeljf12: Artificial intelligence program used against ISIS now used by DNC against Trump &amp; supporters online 

Expecting value: line 2 column 1

 RT @patri_vaquero_: üî¨ A Comparison of DNN, CNN and LSTM using TF/Keras. #BigData #Analytics #DataScience #AI #MachineLearning #IoT #Python‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:54:03 +0000 2020
Tweet content: 
 Learn some #coding. #webdev #python #pythonprogramming  „Åã„Çã„ÅÑ„ÄÅËªΩ„ÅÑ ‚Äì light, easy (karui)  Q: Starting your entrepreneur‚Ä¶ https://t.co/Rq8dd09Jin 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:54:03 +0000 2020
Tweet content: 
 RT @mygovindia: Are you an AI enthusiast and you think your innovative AI solution has the power to solve societal challenges? Then we have‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:54:04 +0000 2020
Tweet content: 
 RT @Towno10: Sacked at Inter ‚úÖ
Sacked at Palace ‚úÖ
Mutually agreed exit at Atlanta üëÄ ‚úÖ
Next HC of the Netherlands ü§∑üèΩ‚Äç‚ôÇÔ∏è 

Expecting value: line 2 column 1 (char 1)
Imported twee

Imported tweet created at: Thu Sep 17 12:56:03 +0000 2020
Tweet content: 
 RT @lalleal: Brilliant presentation, @pndrej! The whole data engineering &amp; data science fields are too naive wrt quality to be trusted with‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:56:04 +0000 2020
Tweet content: 
 Old steam ships in the harbor at Dordrecht the Netherlands  https://t.co/SFGR5u4G0b 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:56:05 +0000 2020
Tweet content: 
 Implement variable game speed https://t.co/CjKdlKLMCh #github #C++ #C #CMake #Shell #Python #Objective-C #Makefile #Dockerfile #Ruby 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:56:06 +0000 2020
Tweet content: 
 RT @ZackTM: Meanwhile in North Carolina, we submitted a FOIA request asking NCDHHS for the ‚Äúscience &amp; data‚Äù showing why it was safe to open‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported 

Imported tweet created at: Thu Sep 17 12:57:58 +0000 2020
Tweet content: 
 @mahi0x00  ANNNA JARA E... C ND PYTHON FREE GA  NERCHUKOVALI 
EDAINA SOURCE CHEPPU ANNA 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:57:59 +0000 2020
Tweet content: 
 C++ program to find subtree with given sum in a binary tree
https://t.co/Xy5l901YA8

#Cppsecrets #cppsecrets‚Ä¶ https://t.co/K5fUlQMgtP 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:58:00 +0000 2020
Tweet content: 
 Morgen opent 'Afropean' van Johny Pitts bij De Balie in Amsterdam https://t.co/TgSWwYECXr @DeBalie 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:58:01 +0000 2020
Tweet content: 
 RT @FaisalJavedKhan: Prime Minister Imran Khan will inaugurate the KP Govt flagship project  Pak-Austria Fachhochschule University in Harip‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 12:58:01 +0000 2020


Imported tweet created at: Thu Sep 17 13:00:01 +0000 2020
Tweet content: 
 @MaucherJenkins have Trainee Patent Attorney/Technical Assistant positions available for candidates with a backgrou‚Ä¶ https://t.co/ulrV88jN32 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:00:02 +0000 2020
Tweet content: 
 #Alojamiento Web para #Python, #RubyOnRails, #Php, #MySql, #MySqli, 99.99% Uptime, #cPanel, #Backups, Migraci√≥n,‚Ä¶ https://t.co/tIIEfzfQmh 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:00:02 +0000 2020
Tweet content: 
 Contract Administrator looking for new opportunities around the Amsterdam area. https://t.co/6tTmsmdDaJ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:00:02 +0000 2020
Tweet content: 
 RT @garydavis0308: Python Development Cost | Cost to Build a Web App with Python

Read: https://t.co/HUi01R309m 

#Python #WebApp #Cost htt‚Ä¶ 

Expecting value: line 2 column 1 (c

Imported tweet created at: Thu Sep 17 13:01:39 +0000 2020
Tweet content: 
 ‚úÖOnline Data Science Training¬†
Master¬†data science with a structured all-in-one training at 50% OFF and earn a Veri‚Ä¶ https://t.co/EBBMO4nISU 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:01:40 +0000 2020
Tweet content: 
 RT @LaForge_AI: Data Science 2020: Data Science &amp; Machine Learning in Python

#DeepLearning #Analytics #learning #DataScience
https://t.co/‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:01:40 +0000 2020
Tweet content: 
 Do how do we get the science and data listened to?? 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:01:40 +0000 2020
Tweet content: 
 #blockchain Oak Grove Recruiting is hiring for the following position: Data science product manager. Link: https://t.co/nvoqeRPtVR 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:01:41 

Imported tweet created at: Thu Sep 17 13:04:01 +0000 2020
Tweet content: 
 vuurvliegje18: vuurvliegje18 (19 jaar) Woonplaats: Amsterdam Land: Nederland Geaardheid: bisexueel Kleur haar: brui‚Ä¶ https://t.co/WE9VJaHEkV 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:04:02 +0000 2020
Tweet content: 
 RT @katka_nedbal: üåç‚õàÔ∏èü§ñüß† 
#AI &amp; #MachineLearning at @ECMWF 
üìë Summary
https://t.co/7Q9gHlNcav
üìÜ #StayTuned 5 - 8 Oct
¬†https://t.co/sztC48VVt‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:04:03 +0000 2020
Tweet content: 
 the highlight of this data science class is the teacher recognizing my last name from my cousin who got manhandled by an elephant on cnn 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:04:04 +0000 2020
Tweet content: 
 RT @roguelynn: Out of curiosity: what Python libraries (stdlib or not) are your favorite to use? Which provide a solid or

Imported tweet created at: Thu Sep 17 13:05:56 +0000 2020
Tweet content: 
 RT @DevelopmentPk: PM Imran Khan inaugurated another flagship project of #KP Government, Pak-Austria Fachhochschule University, #Haripur. U‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:05:58 +0000 2020
Tweet content: 
 Much of the data I needed for this analysis is freely available on NOAA, PANGAEA etc, but I also found lots of data‚Ä¶ https://t.co/6VTxc69PMa 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:06:01 +0000 2020
Tweet content: 
 RT @HollandCT_PT: De meest multi-functionele eventzaal van Amsterdam. Hoeveel 1,5m afstand mogelijkheden tel jij? üòâ Wouter laat ons slechts‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:06:03 +0000 2020
Tweet content: 
 RT @glitchbotio: "Ideas do not reach perfection in a day, no matter how much study is put upon them."- Alexander Graham Bell #softwareen


Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:08:04 +0000 2020
Tweet content: 
 These artificial intelligence generated dogs will make you laugh | @FeedBox https://t.co/Zubp8fRhi4 #AI #Robots #dogs 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:08:04 +0000 2020
Tweet content: 
 .@VSinkevicius STOP IMPUNITY! Don't let #ElectricFishing lobbies dictate EU policy. By closing @Bloom_FR complaints‚Ä¶ https://t.co/tDSag6zeEA 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:08:04 +0000 2020
Tweet content: 
 RT @cOqzhu3MjL1JVmm: @AnnieFDube Why can‚Äôt you @CanadainIndia?

#openvfscanadaindia

https://t.co/HoIu7krSXd 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:08:05 +0000 2020
Tweet content: 
 From the Kingdom of the Netherlands every year and the Kingdom of Saudi Arabia fine and presented on the occasion o‚Ä¶ https://t.co/cTqxCwPP6r 

Expectin

 Tuned in to the #HumboldtDay today - not only was it cool to see the data from the @PFTCourses presented by‚Ä¶ https://t.co/Yg31hEhJnE 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:09:49 +0000 2020
Tweet content: 
 15 is too high already the fck 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:09:51 +0000 2020
Tweet content: 
 Kevin Ashely Examines how artificial intelligence can be designed to account for different sets of values in "Accounting for Legal Values" 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:09:51 +0000 2020
Tweet content: 
 RT @kaisudesu: „Äê„Éë„ÇπÊâì„Å°„Ç™„Éº„ÉàÊ†ºÂÆâË≤©Â£≤„Äë

ÈÄöÂ∏∏Áâà ÈÄ£ÊâìÁâà „Äê3000ÂÜÜ„Äë

ÊîπËâØÁâà ÈÄ£Êâì‰∏çË¶Å „Äê5000ÂÜÜ„Äë

Êõ¥Êñ∞ÈÄüÂ∫¶ ÁàÜ‰∏ä„Åí „Äê7000ÂÜÜ„Äë
‚ÜëÈÄ£Êâì‰∏çË¶Å

„ÉªÂ∞éÂÖ•„Å´„ÅØÂà•„ÅßPython„Ç¢„Éó„É™„Äê1220ÂÜÜ„Äë„ÅåÂøÖË¶Å„Å´„Å™„Çä„Åæ„Åô

‰ΩøÁî®ÂèØËÉΩ„Å´„Å™„Çã„Åæ„Åß„Çµ„Éù„Éº„Éà„Åó„Åæ„Åô‚Ä¶ 

Expecting value: line 2 c

Imported tweet created at: Thu Sep 17 13:11:42 +0000 2020
Tweet content: 
 RT @StadGroen: Zaterdag 19/9 kunnen ge√Ønteresseerden kosteloos een van 10.000 wintergroene struiken ophalen in Amsterdam. De weggeefactie i‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:11:42 +0000 2020
Tweet content: 
 RT @RiyushaA95: NCB got leads that Prime accused Rhea used to order drugs from outside India, Especially from Netherlands...

 #ArrestSSRKi‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:11:43 +0000 2020
Tweet content: 
 RT @Qiita: 25LGTMÔºÅ | ËøΩË∑°„Ç¢„É´„Ç¥„É™„Ç∫„É† (C++/python)ÊÉÖÂ†±„Å∏„ÅÆ„É™„É≥„ÇØ https://t.co/qC39HP8KWD 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:11:44 +0000 2020
Tweet content: 
 RT @Draculasswife: The Grebbeberg, Gelderland, The Netherlands. https://t.co/Q3MpWs7ADm 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:11

Imported tweet created at: Thu Sep 17 13:13:23 +0000 2020
Tweet content: 
 RT @archangeljf12: Artificial intelligence program used against ISIS now used by DNC against Trump &amp; supporters online https://t.co/qZZsSTe‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:13:23 +0000 2020
Tweet content: 
 RT @rinushoogstad: En nou niet de lichtgetinte de schuld geven , want @jesseklaver zei : wat zei hij nou eigenlijk? https://t.co/HZMQM7tE7o 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:13:24 +0000 2020
Tweet content: 
 RT @ianmSC: Nashville‚Äôs case numbers in bars and restaurants were so low they literally were emailing each other about how best to hide tha‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:13:25 +0000 2020
Tweet content: 
 RT @AINewsFeed: COVID-19 Recovery Analysis: Artificial Intelligence Platforms Market | Rise In Demand For AI-based... https://t.co/PruFu

Imported tweet created at: Thu Sep 17 13:15:19 +0000 2020
Tweet content: 
 RT @Hidderkaran: Thread:
- Artificial intelligence mana control vundali kani.
- Manam dani control lo vundakudadu.
- I'm not a slave to A.I‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:15:20 +0000 2020
Tweet content: 
 RT @glenannefilm: A huge rise in DVD requests internationally after last night‚Äôs screening. This batch posting to Australia, US, Netherland‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:15:21 +0000 2020
Tweet content: 
 RT @pythonbot_: scikit-learn Cookbook - Second Edition: Over 80 recipes for machine learning in Python with scikit-learn https://t.co/u5yKf‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:15:21 +0000 2020
Tweet content: 
 Nearly all the castles depicted in Monty Python and the Holy Grail are actually Doune Castle from different angles https://t.co/xv4ZjZfqk

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:16:51 +0000 2020
Tweet content: 
 It was explained in Godzilla 1998, people. SOURCES!!!! https://t.co/BV4E22jvlA 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:16:52 +0000 2020
Tweet content: 
 RT @danbeltran: 6 Ways Artificial Intelligence and Machine Learning Can Improve Your Marketing https://t.co/m7vTituiOh #machinelearning 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:16:52 +0000 2020
Tweet content: 
 RT @SOliver2020: Zaandam, Netherlands üá≥üá± https://t.co/x2zQVUL56g 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:16:53 +0000 2020
Tweet content: 
 RT @Zerynth: Do you want to learn more about communication protocols and #MQTT in the world of #IoT? Watch our webinar today:
https://t.co/‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:16:54 +0000 

Imported tweet created at: Thu Sep 17 13:18:27 +0000 2020
Tweet content: 
 RT @WOFES3WAGYIMI: BROTHERS BUT DIFFERENT NATIONAL TEAMS

Paul Pogba (France üá®üáµ)
Florentin Pogba (Guinea üá¨üá≥)
Mathias Pogba (Guinea üá¨üá≥)

Chr‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:18:29 +0000 2020
Tweet content: 
 RT @YukiSatoLaw: Five Ways That Artificial Intelligence and Machine Learning Are Transforming Legal Practice
Cenza Technologies
https://t.c‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:18:29 +0000 2020
Tweet content: 
 RT @artechnco: Isusu https://t.co/iRT22q1qYw 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:18:30 +0000 2020
Tweet content: 
 RT @ProgrammingHero: [Guidelines] Programming for Beginners!
.
.
.
.
.
https://t.co/uBvXSlkOmU

#BigData #Analytics #DataScience #DeepLearn‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu S

Tweet content: 
 RT @LeaveEUOfficial: WATCH | @GeertWildersPVV: "The Patriotic Spring is happening all over the Western world from America to Britain." Neth‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:20:11 +0000 2020
Tweet content: 
 RT @Jibun_no_Atama: ÂåªÂ≠¶ÁöÑË¶≥ÁÇπ„Åã„Çâ„ÅØ„Éû„Çπ„ÇØ„ÅÆÊúâÂäπÊÄß„ÅØË®ºÊòé„Åï„Çå„Å¶„ÅÑ„Å™„ÅÑ„Åü„ÇÅ„ÄÅÂÜÖÈñ£„ÅØÈùûÂåªÁôÇÁî®„Éû„Çπ„ÇØ„ÇíÁùÄÁî®„Åô„ÇãÁæ©Âãô„ÅØ„Å™„ÅÑ„Åì„Å®„ÇíÊ±∫ÂÆö„Åó„Åæ„Åó„Åü
„Ç™„É©„É≥„ÉÄÂåªÁôÇÂ§ßËá£

„ÅØ„ÅÑ„ÄÅ
„Åì„ÅÆËå∂Áï™„Åã„Çâ„Ç™„É©„É≥„ÉÄ„ÅÑ„Å°Êäú„Åë

Êó•Êú¨„ÅØ‰ΩïÁï™ÁõÆ„Åã„Å™Ôºü

ÈöôÈñì„Å†„Çâ„Åë„Éû„Çπ„ÇØÁùÄÁî®„Åß„Ç¶„Ç§„É´„ÇπÊã°Â§ß„ÅåÈò≤„Åí„Çã‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:20:11 +0000 2020
Tweet content: 
 Looking Python Developer For Live  Data Scraping -- 2 - https://t.co/pszXOEWj36 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:20:11 +0000 2020
Tweet content: 
 RT @ianmSC: Nashville‚Äôs case numbers

Imported tweet created at: Thu Sep 17 13:22:21 +0000 2020
Tweet content: 
 RT @thiesbeckers: The two largest coalition partners in the Netherlands want to investigate how we can deploy new #nuclear in the Netherlan‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:22:21 +0000 2020
Tweet content: 
 D√≠a 5. Galer√≠a de fotos con #Python 

#100DaysOfCode #pythonprogramming https://t.co/5NiUfMc4KA 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:22:21 +0000 2020
Tweet content: 
 RT @lifebiomedguru: FDA has never conducted a dose escalation study of aluminum in mice.  #IPAK has invited and has received a proposal for‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:22:22 +0000 2020
Tweet content: 
 RT @036moji: day47 45m #100DaysOfCode IP15m
#Python #paiza

‚úÖÂïèÈ°åÈõÜ D„É©„É≥„ÇØÊó©Ëß£„Åç„Çª„ÉÉ„Éà 20/20

„ÅäÁñ≤„Çå„Åï„Åæ„Åß„Åó„Åü
„Åä„ÇÑ„Åô„Åø„Å™„Åï„ÅÑ

#„Éó„É≠„Ç∞„É©„Éü„É≥„Ç∞Â≠¶Áø

Imported tweet created at: Thu Sep 17 13:24:21 +0000 2020
Tweet content: 
 RT @alferdi: #BoneiruTinTwitter any Bonairians living in the Netherlands who feel like taking a FREE day trip on Sunday Sept 27 to the Bonn‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:24:22 +0000 2020
Tweet content: 
 RT @EdisonLanza: Amenazas en l√≠nea a periodistas. ¬øSe puede hacer algo? En Holanda los fiscales est√°n poniendo foco en investigar estas ame‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:24:22 +0000 2020
Tweet content: 
 RT @MilHistNow: On this day in 1944, the first of 41,000 Allied paratroopers are dropped into the Netherlands as part of the ill-fated Oper‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:24:23 +0000 2020
Tweet content: 
 @bakfietsdc I'm so envious of the transportation infrastructure in the Netherlands. 

Expecting value: line 2 column 1 (char 1)
Imported

 RT @ianmSC: Nashville‚Äôs case numbers in bars and restaurants were so low they literally were emailing each other about how best to hide tha‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:26:18 +0000 2020
Tweet content: 
 RT @DavideCamera: R for Beginners üìí
 
https://t.co/jjsu78QB6j 

#Algorithms #BigData #Analytics #DataScience #MachineLearning #AI #IoT #IIo‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:26:18 +0000 2020
Tweet content: 
 @wesbury Never seen a word (science) used in so many ways to rationalize policy action. Follow the data, not those‚Ä¶ https://t.co/QdMotSVaOm 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:26:19 +0000 2020
Tweet content: 
 RT @DavideCamera: R for Beginners üìí
 
https://t.co/jjsu78QB6j 

#Algorithms #BigData #Analytics #DataScience #MachineLearning #AI #IoT #IIo‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet crea

Imported tweet created at: Thu Sep 17 13:27:48 +0000 2020
Tweet content: 
 RT @PTIofficial: PM @ImranKhanPTI inaugurated another flagship project of KP Government, Pak-Austria Fachhochschule University, Haripur. Un‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:27:48 +0000 2020
Tweet content: 
 RT @c_hugman: WBA's Climate and Energy Benchmarks have made it clear that it is time for business leaders to step up and take action in lin‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:27:49 +0000 2020
Tweet content: 
 RT @TheInsaneApp: RigNet: Neural Rigging for Articulated Characters

Code: https://t.co/22orkZVc2E

#theinsaneapp #ML #AI
#MachineLearning‚Ä¶ 

Expecting value: line 2 column 1 (char 1)
Imported tweet created at: Thu Sep 17 13:27:51 +0000 2020
Tweet content: 
 - WORLD MANUFACTURING FORUM 2020 -

11-12 November

Artificial Intelligence for the #Manufacturing Renaissance - Th‚Ä¶ https://t.co/d4y6gSra

In [None]:
for i in range(len(tweets_data)):
    print(tweets_data[i]['lang'])

In [None]:
list(map(lambda tweet: tweet['text'], tweets_data))

We can notice that the data is very noisy. It contains a lot of html artifacts, emojis, links and even extra metadata that we do not need at this time or it obstructs the clarity of the content in the tweet.

In cases like these, a preprocessing step is required before analysis can be performed.

As a first step in this direction we will structure the tweets data into a pandas DataFrame to simplify the data manipulation. We will start by creating an empty DataFrame called tweets and we will add 3 columns to the tweets DataFrame called text, lang, and country. text column contains the tweet, lang column contains the language in which the tweet was written, and country the country from which the tweet was sent.

#### This step can take a minute or two

In [None]:
import pandas as pd

tweets = pd.DataFrame()

tweets['text'] =    list(map(lambda tweet: tweet['text'], tweets_data))
tweets['lang'] =    list(map(lambda tweet: tweet['lang'], tweets_data))
tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))


Next, we will create 2 charts
+ The first one describing the Top 5 languages in which the tweets were written
+ The second the Top 5 countries from which the tweets were sent.

We will create these two charts using MatPlotLib (the library we installed in the begining of the assignment).

In [None]:
import matplotlib.pyplot as plt

# This is a directive that enables displaying charts in iPython notebooks.
%matplotlib inline


tweets_by_lang = tweets['lang'].value_counts()

fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Languages', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')

We can do the same thing for countries:

In [None]:
#Note - many times no country is scored, so you might have very few entries in this histogram (perhaps none or 1)

tweets_by_country = tweets['country'].value_counts()

fig, ax = plt.subplots()
ax.tick_params(axis='x', labelsize=15)
ax.tick_params(axis='y', labelsize=10)
ax.set_xlabel('Countries', fontsize=15)
ax.set_ylabel('Number of tweets' , fontsize=15)
ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')

### Food for thought

There are plenty of ways the gathered data can be skewed and manipulated a false image about something. This is a technique used in marketing very often. Can you think of ways to prove the statistic we displayed in the previos section as biased?

#### Answer

Bias can be found in that we filtered by English-language tweets, as well as for Netherland-specific terms.

### Extracting links from tweets

Tweets very often carry additional context information to the statement they are making in a hyperlink. Extracting these hyperlinks from the tweets might provide an expansion for thte dataset you are collection or analyzing. A usefull skill in data science is to extract this type of information. We will do this by using regular expressions. Python provides a library for regular expression called re. 

We will start by importing this library and creating a function that checks if a specific keyword is present in a text and a second function that extracts the hyperlink from a the tweets content.

In [None]:
import re

# A function that extracts the hyperlinks from the tweet's content.
def extract_link(text):
    regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
    match = re.search(regex, text)
    if match:
        return match.group()
    return ''

# A function that checks whether a word is included in the tweet's content
def word_in_text(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)
    if match:
        return True
    return False

Next we add a column to our predifined Data Frame with:

In [None]:
tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

With the help of this frame we can then print all the links from the tweets that we gathered and show them on the screen below:

In [None]:
print(tweets['link'])

**Removing Hyperlinks**

An additional use case that comes up from detecting the hyperlinks in the tweets is their removal. When subjecting the tweet's content to tokenization of mapping it with some sort of an embedding, it is recomended that artifacts like hyperlinks be removed first.

#### You have to define your own 'index_in_dataframe_containing_link'. Pick one of the non-null entries above from "print(tweets['link'])"

In [None]:
index_in_dataframe_containing_link = 0
unescaped_tweet = tweets_data[index_in_dataframe_containing_link]['text']
# With link in the content
print("With link:\n", unescaped_tweet)

# With the link removed
result = re.sub(r"http\S+", "", unescaped_tweet)
print ("\n\nLink free:\n",result)

### Tokenization

**Definition**: Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

In [None]:
from nltk.tokenize import RegexpTokenizer
import html.parser as HTMLParser# In Python 3.4+ import html 
import nltk

tokenizer = RegexpTokenizer(r'\w+')

dirty_tweet_tokens = tokenizer.tokenize(unescaped_tweet.lower())

cleaned_tweet_tokens = tokenizer.tokenize(result.lower())

print("Clean tokens:\n", cleaned_tweet_tokens)

print("\n\nDirty tokens:\n", dirty_tweet_tokens)

In [None]:
print("Actually got to this point and understood everything(!!!)")

You can notice that the tokens we get from the tweet containing the url have data that is not relevant to natural language and therfore any further analysis based on that.

## Sentiment Analysis

**Sentiment Analysis** is the process of determining whether a piece of writing is positive, negative or neutral. It‚Äôs also known as opinion mining, deriving the opinion or attitude of a speaker. A common use case for this technology is to discover how people feel about a particular topic.

There are two general directions in which you can steer your sentiment analysis pipeline:
+ Lexicon Based Approaches
+ Machine Learning Approaches

#### A Basic Machine Learning Approach

NLTK comes with all the pieces you need to get started on sentiment analysis: a movie reviews corpus with reviews categorized into pos and neg categories, and a number of trainable classifiers. We‚Äôll start with a simple NaiveBayesClassifier as a baseline, using boolean word feature extraction.

**What is a Classifier?**

Wikipedia says: "<span style='color:red'>An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category.</span>"

It is basically a computer program that learns how to map a certain input to a certain output. It is able to translate the data in the input space into a different segmented output space, where each datum belongs its own dimension. For our example we are going to use the Naive Bayes classifier. This classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. This makes this classifier simple and easy to use, however in a limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.

So before we start doing anything we need a ready corpus of data we can use to demonstrate the concept of Sentiment Analysis.

In [6]:
# This snippet downloads the most popular datasets for experimenting with NLTK functionalities.
import nltk
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /home/mark/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /home/mark/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /home/mark/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /home/mark/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /home/mark/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /home/mark/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk

True

As a first step we import the required NLTK modules and define a simple function that is going to extract our features:


In [7]:
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews


# A function that extracts which words exist in a text based on a list of words to which we compare.
def word_feats(words):
        return dict([(word, True) for word in words])

# Get the negative reviews for movies    
negids = movie_reviews.fileids('neg')

# Get the positive reviews for movies
posids = movie_reviews.fileids('pos')
 
# Find the features that most correspond to negative reviews    
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]

# Find the features that most correspond to positive reviews
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]

# We would only use 1500 instances to train on. The quarter of the reviews left is for testing purposes.
negcutoff = int(len(negfeats)*3/4)
poscutoff = int(len(posfeats)*3/4)

In [8]:
# Construct the training dataset containing 50% positive reviews and 50% negative reviews
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]

# Construct the test dataset containing 50% positive reviews and 50% negative reviews
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

print ('train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)))

# Train a NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(trainfeats)

# Test the trained classifier and display the most informative features.
print ('accuracy:', nltk.classify.util.accuracy(classifier, testfeats))
classifier.show_most_informative_features()

train on 1500 instances, test on 500 instances
accuracy: 0.728
Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0


### Tweets Example

The previous was somewhat large scale. We had a dataset fo 2000 film reviews. A smaller tweet dataset would better serve our cause. See example below:

In [9]:
# For this example we define our own dataset of 5 positive and 5 negative tweets.

# Positive tweets and their sentiment label
pos_tweets = [('I love this car', 'positive'),
              ('This view is amazing', 'positive'),
              ('I feel great this morning', 'positive'),
              ('I am so excited about the concert', 'positive'),
              ('He is my best friend', 'positive')]

# Negative tweets and their sentiment label
neg_tweets = [('I do not like this car', 'negative'),
              ('This view is horrible', 'negative'),
              ('I feel tired this morning', 'negative'),
              ('I am not looking forward to the concert', 'negative'),
              ('He is my enemy', 'negative')]

# The list of tweets we are going to use for testing (groundtruth)
test_tweets = [(['feel', 'happy', 'this', 'morning'], 'positive'),
    (['larry', 'friend'], 'positive'),
    (['not', 'like', 'that', 'man'], 'negative'),
    (['house', 'not', 'great'], 'negative'),
    (['your', 'song', 'annoying'], 'negative')]


We take both of those lists and create a single list of tuples each containing two elements. First element is an array containing the words and second element is the type of sentiment. We get rid of the words smaller than 2 characters and we use lowercase for everything.

In [10]:
# pprint is a module for pretty printing
from pprint import pprint


tweets = []

# In this for loow we create a list of tuples like: (word_longer_than_3_letters, sentiment_label)
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3]
    tweets.append((words_filtered, sentiment))

# Printing how our dataset looks like after we have performed our own 'custom' tokenization of the tweets.
print("### Training examples ###\n")
pprint(tweets)

print("\n\n### Testing examples ###\n")
pprint(test_tweets)


### Training examples ###

[(['love', 'this', 'car'], 'positive'),
 (['this', 'view', 'amazing'], 'positive'),
 (['feel', 'great', 'this', 'morning'], 'positive'),
 (['excited', 'about', 'the', 'concert'], 'positive'),
 (['best', 'friend'], 'positive'),
 (['not', 'like', 'this', 'car'], 'negative'),
 (['this', 'view', 'horrible'], 'negative'),
 (['feel', 'tired', 'this', 'morning'], 'negative'),
 (['not', 'looking', 'forward', 'the', 'concert'], 'negative'),
 (['enemy'], 'negative')]


### Testing examples ###

[(['feel', 'happy', 'this', 'morning'], 'positive'),
 (['larry', 'friend'], 'positive'),
 (['not', 'like', 'that', 'man'], 'negative'),
 (['house', 'not', 'great'], 'negative'),
 (['your', 'song', 'annoying'], 'negative')]


Exactly like the example above, we define two functions. One for extracting the list of words in our tweet corpora and a second one to get the features on which we will train a classifier. In this case, our features would be the word appearance frequencies.

In [11]:
# Get the separate words in tweets
# Input:  A list of tweets
# Output: A list of all words in the tweets
def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
        all_words.extend(words)
    return all_words

# Create a dictionary measuring word frequencies
# Input: the list of words
# Output: the frequency of those words apearing in tweets
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    print ("Word frequency list\n")
    pprint(wordlist)
    return word_features




To create a classifier, we need to decide what features are relevant. To do that, we first need a feature extractor. The one we are going to use returns a dictionary indicating what words are contained in the input passed. Here, the input is the tweet. We use the word features list defined above along with the input to create the dictionary.

With our feature extractor, we can apply the features to our classifier using the method apply_features. We pass the feature extractor along with the tweets list defined above.

In [12]:
word_features = get_word_features(get_words_in_tweets(tweets))

# Construct our features based on which tweets contain which word
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features


Word frequency list

{'about': 1,
 'amazing': 1,
 'best': 1,
 'car': 2,
 'concert': 2,
 'enemy': 1,
 'excited': 1,
 'feel': 2,
 'forward': 1,
 'friend': 1,
 'great': 1,
 'horrible': 1,
 'like': 1,
 'looking': 1,
 'love': 1,
 'morning': 2,
 'not': 2,
 'the': 2,
 'this': 6,
 'tired': 1,
 'view': 2}


As you can see, **‚Äòthis‚Äô** is the most used word in our tweets, followed by **‚Äòcar‚Äô**, followed by **‚Äòconcert‚Äô**‚Ä¶

The variable ‚Äòtraining_set‚Äô contains the labeled feature sets. It is a list of tuples which each tuple containing the feature dictionary and the sentiment string for each tweet. The sentiment string is also called ‚Äòlabel‚Äô.

In [13]:
# Here we apply the features we constructed to our tweets data.
training_set = nltk.classify.apply_features(extract_features, tweets)

# Printing the resulting training set shows the features we are going to pass to the classifier.
pprint(training_set)

[({'contains(love)': True, 'contains(this)': True, 'contains(car)': True, 'contains(view)': False, 'contains(amazing)': False, 'contains(feel)': False, 'contains(great)': False, 'contains(morning)': False, 'contains(excited)': False, 'contains(about)': False, 'contains(the)': False, 'contains(concert)': False, 'contains(best)': False, 'contains(friend)': False, 'contains(not)': False, 'contains(like)': False, 'contains(horrible)': False, 'contains(tired)': False, 'contains(looking)': False, 'contains(forward)': False, 'contains(enemy)': False}, 'positive'), ({'contains(love)': False, 'contains(this)': True, 'contains(car)': False, 'contains(view)': True, 'contains(amazing)': True, 'contains(feel)': False, 'contains(great)': False, 'contains(morning)': False, 'contains(excited)': False, 'contains(about)': False, 'contains(the)': False, 'contains(concert)': False, 'contains(best)': False, 'contains(friend)': False, 'contains(not)': False, 'contains(like)': False, 'contains(horrible)': Fa

Now that we have our training set, we can train our classifier like in the previous example.

In [14]:
# This is the line of code that we use to train our classifier. Training is performed in a streamlined way so no output is visible.
classifier = nltk.NaiveBayesClassifier.train(training_set)

The Naive Bayes classifier uses the prior probability of each label which is the frequency of each label in the training set, and the contribution from each feature. In our case, the frequency of each label is the same for ‚Äòpositive‚Äô and ‚Äònegative‚Äô. The word ‚Äòamazing‚Äô appears in 1 of 5 of the positive tweets and none of the negative tweets. This means that the likelihood of the ‚Äòpositive‚Äô label will be multiplied by 0.2 when this word is seen as part of the input.

So in our dataset the probability of each label is 0.5 as we can see below.

### <span style='color:red'>**Interesting observation**</span>

If we observer the output of the function below, an interesting observation jumps out. Line one of the output has this content:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;contains(not) = False          positi : negati =      1.6 : 1.0

This line tells us that if a tweet doesn't contain the word not **(contains(not) = False)** than it is 60% more likely to be positive than negative visible in: **(positi : negati = 1.6 : 1.0)**.

In [15]:
print (classifier.show_most_informative_features(32))

Most Informative Features
           contains(not) = False          positi : negati =      1.6 : 1.0
         contains(about) = False          negati : positi =      1.2 : 1.0
       contains(amazing) = False          negati : positi =      1.2 : 1.0
          contains(best) = False          negati : positi =      1.2 : 1.0
         contains(enemy) = False          positi : negati =      1.2 : 1.0
       contains(excited) = False          negati : positi =      1.2 : 1.0
       contains(forward) = False          positi : negati =      1.2 : 1.0
        contains(friend) = False          negati : positi =      1.2 : 1.0
         contains(great) = False          negati : positi =      1.2 : 1.0
      contains(horrible) = False          positi : negati =      1.2 : 1.0
          contains(like) = False          positi : negati =      1.2 : 1.0
       contains(looking) = False          positi : negati =      1.2 : 1.0
          contains(love) = False          negati : positi =      1.2 : 1.0

Now having seen the feature distribution for our data we can see how our classifier behaves in a real scenario where we apply it to a tweet it ahs not seen before and is not a part of the train or test set.

If our tweet is:
** Larry is my friend **

We would expect that the attributed sentiment would be: **positive**

In [16]:
# The tweet we are about to classify
tweet = 'Larry is my friend'
print (classifier.classify(extract_features(tweet.split())))


positive


On the other hand if our tweet is: **This dish is horrible**

We would expect that the attributed sentiment would be: **negative**

In [17]:
# The tweet we are about to classify
tweet = 'This dish is horrible'
print (classifier.classify(extract_features(tweet.split())))

negative


However our simple classifier, trained on just 10 tweets is easy to fool. For example we have not encountered the word **horrendous**. So if our tweet would be:

**Ivo listens to horrendous electronic music.**

We would not know what to expect.

In [18]:
tweet = 'Ivo listens to horrendous electronic music'
print (classifier.classify(extract_features(tweet.split())))

positive


In our case, the simple classifier made a mistake.

## Topic Modeling

One technique for text mining in Data Science is Topic Modelling. As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.

Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called ‚Äútopics‚Äù) in large clusters of texts.

Topics can be defined as ‚Äúa repeating pattern of co-occurring terms in a corpus‚Äù. A good topic model should result in ‚Äì ‚Äúhealth‚Äù, ‚Äúdoctor‚Äù, ‚Äúpatient‚Äù, ‚Äúhospital‚Äù for a topic ‚Äì Healthcare, and ‚Äúfarm‚Äù, ‚Äúcrops‚Äù, ‚Äúwheat‚Äù for a topic ‚Äì ‚ÄúFarming‚Äù.

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example ‚Äì New York Times are using topic models to boost their user ‚Äì article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.

### What is Latent Dirichlet Allocation (LDA)?

There are many approaches for obtaining topics from a text such as ‚Äì Term Frequency and Inverse Document Frequency. NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation is the most popular topic modeling technique and in this article, we will discuss the same.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.

LDA is a matrix factorization technique. In vector space, any corpus (collection of documents) can be represented as a document-term matrix. 



### LDA Parameters

Alpha and Beta Hyperparameters ‚Äì alpha represents document-topic density and Beta represents topic-word density. Higher the value of alpha, documents are composed of more topics and lower the value of alpha, documents contain fewer topics. On the other hand, higher the beta, topics are composed of a large number of words in the corpus, and with the lower value of beta, they are composed of few words.

Number of Topics ‚Äì Number of topics to be extracted from the corpus. Researchers have developed approaches to obtain an optimal number of topics by using Kullback Leibler Divergence Score. I will not discuss this in detail, as it is too mathematical. For understanding, one can refer to this[1] original paper on the use of KL divergence.

Number of Topic Terms ‚Äì Number of terms composed in a single topic. It is generally decided according to the requirement. If the problem statement talks about extracting themes or concepts, it is recommended to choose a higher number, if problem statement talks about extracting features or terms, a low number is recommended.

Number of Iterations / passes ‚Äì Maximum number of iterations allowed to LDA algorithm for convergence.


### Sample Topic Modeling Assignment

As step one of this assignment we will construct our own dataset containing 5 documents on different topics like below:


In [1]:
doc1 = "Working out is great for the body. Fitness makes you feel good."
doc2 = "Red cars are faster than blue cars."
doc3 = "Doctors suggest that fitness increases muscle mass and speeds up metabolism."
doc4 = "Cars with electrical engines cause less polution than cars with internal combustion engines."
doc5 = "Pushups make a good upper body excercise."

# compile documents
doc_complete = [doc1, doc2, doc3, doc4, doc5]

We already noted that cleaning is an important step before any text mining task, in this step, we will remove the punctuations, stopwords and normalize the corpus. This makes it suitable for further analysis.

We introduced the word <i>stopwords</i>. Stopwords are words that are filtered out before any analysis of natural language in order to increase efficiency and remove clutter. Examples of stop words are:
+ and
+ there
+ want
+ thus
+ if 
+ etc...

Another new thing we will encounter in the code below is a Lemmatizer. A lematizer rests on the lemmatization process which essentially extracts the root of the word and removes any additional artifacts. A more formal definition is provided below:

<span style='color:red'>Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma. </span>

**Note**: If unclear about the implementation of the methods please consult the NLTK documentation.

In [2]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string

# Create a set of stopwords
stop = set(stopwords.words('english'))

# Create a set of punctuation words 
exclude = set(string.punctuation) 

# This is the function makeing the lemmatization
lemma = WordNetLemmatizer()

# In this function we perform the entire cleaning
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

# This is the clean corpus.
doc_clean = [clean(doc).split() for doc in doc_complete] 

### Preparing Document-Term Matrix

All the text documents combined is known as the corpus. To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix. Python provides many great libraries for text mining practices, ‚Äúgensim‚Äù is one such clean and beautiful library to handle text data. It is scalable, robust and efficient. 

Following code shows how to convert a corpus into a document-term matrix.

In [3]:
# Importing Gensim
import gensim
from gensim import corpora

# Creating the term dictionary of our courpus, where every unique term is assigned an index
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

In [4]:
print(dictionary)

Dictionary(28 unique tokens: ['body', 'feel', 'fitness', 'good', 'great']...)


In [5]:
print(doc_term_matrix)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(7, 1), (8, 2), (9, 1), (10, 1)], [(2, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)], [(8, 2), (18, 1), (19, 1), (20, 1), (21, 2), (22, 1), (23, 1), (24, 1)], [(0, 1), (3, 1), (5, 1), (25, 1), (26, 1), (27, 1)]]


### Run the LDA model

Next step is to create an object for LDA model and train it on Document-Term matrix. The training also requires few parameters as input which are explained in the above section. The gensim module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents.


In the code below we are running the LDA model on two topics with the words we have defined in our dictionary for 100 itterations. Feel free to change the number of itterations and see the outcome.

In [6]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=2, id2word = dictionary, passes=100)

Each line is a topic with individual topic terms and weights. Topic1 can be termed as Vehicles and Engines, and Topic 2 can be termed as Benefits of Fitness.

In [7]:
# Print 2 topics and describe then with 4 words.
topics = ldamodel.print_topics(num_topics=2, num_words=4)

i=0
for topic in topics:
    print ("Topic",i ,"->", topic)     
    i+=1


Topic 0 -> (0, '0.122*"car" + 0.068*"engine" + 0.041*"combustion" + 0.041*"cause"')
Topic 1 -> (1, '0.092*"good" + 0.092*"body" + 0.092*"make" + 0.056*"fitness"')


We notice that our LDA model performed well in guessing the two topics our documents covered.

## Assignment
### This assignment should result in a report following the report guidelines found on Canvas. The full details of the assignment are also found on Canvas.
### Due date: <b style='color: red'>  25/09/20, 23:59:59 </b>

So far we have covered the following sections:

+ Basic Python development
+ Pandas data management
+ Gathering data (scraping the Twitter Streaming API)
+ Storing and organizing it (store to file or a database)
+ Preprocessing the data
+ Performing sentiment and topical analysis
+ Visualizing insight

Given the newly acquired skills, your assignment is to perform an analysis on a dataset of tweets already provided in the course. The analysis should contain sentiment analysis and try to include topic modelling. Use different splits of the data you have to perform your analysis, compare, correlate and visualize.

The dataset can be found on Canvas in the "2020_Assignments" folder.

All of the tweets have geolocation on them so it would be natural to show the geographical distribution of the analysis you performed on the map you designed in Week 2 of the course. When visualising the results of your analysis keep in mind that you can change the size, color, location or even boundries of the map. You can also hide and show regions depending on what is the point you are trying to make.

Two ideas to try to correlate the findings with:

+ Demographics - e.g. average stage age
+ Past State vote percentage difference

This information can be obtained from the sources mentioned in the lectures.

As an additional possible resource to help with the sentiment analysis, there is a Java based utility in the same folder named SentiStrenght.

[Sentistrength (Free download version)](http://sentistrength.wlv.ac.uk/#Download)