<a href="https://colab.research.google.com/github/marcelolandivar/Python_Projects/blob/master/Twitter_scrapping_%2B_Naives_Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TWITTER SCRAPPING AND TEXT CLASSIFICATION WITH NAIVES BAYES
###By: Marcelo Landivar

*  The current notebook is a test for scrapping tweets from different accounts and provides different features to extract more specific tweets.
*  The model classifies the tweets to different topic lavbels using a Simplify Naives Bayes Model trained using 20news dataset.
*  The final result is a table with the tweet and the corresponding label. The output can also be a CSV file.


---

>**Email:** <MarceloLandivar24@gmail.com>\
> **RESOURCES:**  Sklearn, GetOldTweets


Open this notebook in Google Colaboratory: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1HI3QOn60OF3IQDrDsW3SmjiB-ErHNTT9?usp=sharing)




In [1]:
import numpy as np
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline

  import pandas.util.testing as tm


In [2]:
"""Install GetOldTweets library:
   Exports tweets to a specified csv file ("output_got.csv" by default)."""
!pip install GetOldTweets3

'Install GetOldTweets library:\n   Exports tweets to a specified csv file ("output_got.csv" by default).'

Collecting GetOldTweets3
  Downloading https://files.pythonhosted.org/packages/ed/f4/a00c2a7c90801abc875325bb5416ce9090ac86d06a00cc887131bd73ba45/GetOldTweets3-0.0.11-py3-none-any.whl
Collecting pyquery>=1.2.10
  Downloading https://files.pythonhosted.org/packages/78/43/95d42e386c61cb639d1a0b94f0c0b9f0b7d6b981ad3c043a836c8b5bc68b/pyquery-1.4.1-py2.py3-none-any.whl
Collecting cssselect>0.7.9
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Installing collected packages: cssselect, pyquery, GetOldTweets3
Successfully installed GetOldTweets3-0.0.11 cssselect-1.1.0 pyquery-1.4.1


In [111]:
#Defining the model to be able to draw tweets from Twitter
import GetOldTweets3 as got

class tweets:
  """
  Scrapping tweets using GetOldTweets library. It is necessary to set the account name and the maximum number of tweets. Additionally, it is possible to 
  mention the dates from and to and bolean whether you want the top tweets. Also, all data from the tweets and user can be displayed.
  """
  def __init__(self, account_name, max_tweets, date_from=None, date_to=None, top_tweets=False):
    self.account_name = account_name
    self.max_tweets = max_tweets
    self.date_from= date_from
    self.date_to = date_to
    self.top_tweets= top_tweets
    if (self.date_from==None) & (self.date_to==None):
      self.tweet_config = got.manager.TweetCriteria().setUsername(self.account_name).setMaxTweets(self.max_tweets).setTopTweets(self.top_tweets)
    else:
      self.tweet_config = got.manager.TweetCriteria().setUsername(self.account_name).setMaxTweets(self.max_tweets).setSince(self.date_from).setUntil(self.date_to).setTopTweets(self.top_tweets)

    
  def print_tweets(self):
    tweets_list = []
    for i in range(self.max_tweets):
      tweet = got.manager.TweetManager.getTweets(self.tweet_config)[i]
      tweets_list.append(tweet.text)
    return tweets_list

  def tweets_data(self):
    tweets_data_list = []
    for i in range(self.max_tweets):
      tweet=got.manager.TweetManager.getTweets(self.tweet_config)[i]
      tweets_data_list.append([i,tweet.username, tweet.favorites, tweet.retweets, tweet.date])
    return tweets_data_list

  def tweets_topic(self, topic):
    "Add a label based on the user and tweets you have scrapped"
    self.topic = topic
    tweets_list = []
    for i in range(self.max_tweets):
      tweet = got.manager.TweetManager.getTweets(self.tweet_config)[i]
      tweets_list.append([tweet.text, topic])
    return tweets_list


In [113]:
# Extracting tweets for two different categories: Baseball and Apple computers news

jmc = tweets('jamiemclennan29', 25)
tweets_jmc = jmc.print_tweets()
tweets_data_jmc= jmc.tweets_data()

mac = tweets('MacRumors', 25)
tweets_mac = mac.print_tweets()
tweets_data_mac = mac.tweets_data()

xy = tweets('2010MisterChip', 2)
xy_print = xy.print_tweets()

In [67]:
# Visualizing one of the tweets and the data for the tweet
tweets_data_mac[3], tweets_mac[3]

([3,
  'MacRumors',
  38,
  12,
  datetime.datetime(2020, 8, 19, 9, 20, 24, tzinfo=datetime.timezone.utc)],
 'Korean Startups Call for Investigation into Apple and Google In-App Purchases https://www.macrumors.com/2020/08/19/korea-call-for-investigation-into-apple-and-google/ by @hartleycharlton')

In [209]:
# Randomize the extracted tweets
def randomize_tweets(*tweets):
  lista = []
  for tweet in tweets:
    for t in tweet:
      lista.append(t)
  tweets_list= random.sample(lista, len(lista)) 
  return tweets_list

In [210]:
tweets= randomize_tweets(tweets_mac, tweets_jmc, xy_print)

In [5]:
#It is possible to use the original dataset of 20news or use a different a more complete dataset
#!wget -c http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz
#!tar -zxvf /content/20news-18828.tar.gz

In [212]:
# Prepare Multinomial Bayes' classifier
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Fetech the 20news data, create the categories from the dataset and set the training dataset
data = fetch_20newsgroups()
categories = list(enumerate(data.target_names))
train_dataset = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

#Create a model for classifiying the tweets
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
trained_model = model.fit(train_dataset.data, train_dataset.target)
labels = trained_model.predict(tweets)

In [233]:
# Provide the final prediction of the tweets and the chance to save the results in a csv in the current working folder
def prediction_dataframe(labels, categories, csv=False):

  def predict_category(labels, categories):
    predictions = []
    for a,b in categories:
      for i in labels:
        if i==a:
          predictions.append(b)

    return predictions

  prediction = predict_category(labels, categories)
  result = pd.DataFrame(list(zip(tweets, prediction)), 
               columns =['Tweet', 'Prediction'])
  if csv==True:
    result.to_csv("Tweets_Topic_Classification.csv", encoding='utf-8') 
  return result

In [235]:
# Table with the results
result = prediction_dataframe(labels, categories)
result[:3]

Unnamed: 0,Tweet,Prediction
0,Apple Seeds Fifth Beta of tvOS 14 to Developer...,comp.graphics
1,Epic Games Aiming to Recruit ‘Coalition of App...,comp.graphics
2,Apple Says ‘We Won’t Make an Exception’ for Ep...,comp.graphics


The model itself is straightforward and simple. It can be improved with a better and more complete dataset. The Naive Bayes Classifier is very interesting method and can be improved using n-gram probabilistic model. Or it is possible to go beyond that and use Recurrent Neural Networks (RNN) using big corpus of several tweets in a single text. 