* **ID: 726832020**
* **Email: Lawal1998@yahoo.com** 

# Web Scraping Challenge
In this project, I will scrape data from several web pages to make Identify influencer's rank position on Twitter. Some of the site I will be using to obtain the data are:
* [100 most influential Twitter users in Africa](https://africafreak.com/100-most-influential-twitter-users-in-africa)
* [African leaders respond to coronavirus… on Twitter](https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa)
* [Top 18 African Heads of State on Twitter: it’s a mixed bag](https://enitiate.solutions/top-18-african-heads-of-states-on-twitter/)


In [1]:
# Installing and importing neccesary packages
!pip install requests BeautifulSoup4 fire

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup



In [2]:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
import pandas as pd
import os, sys

import fire

## Hundred Most Influential Twitter Users in Africa

In the next series of cells, I will be scraping the Name and Twitter handle of the 100 most influential users in Africa according to the website provided in [100 most influential Twitter users in Africa](https://africafreak.com/100-most-influential-twitter-users-in-africa). This website uses three key metrics to find the top 100 influencers:

* Popularity (Retweet Influence): measured by the number of Retweets and Likes users get 
* Reach (Indegree Influence): measured by the size of their audience 
* Relevance (Mentions Influence): measured by the relevancy of their content
 

In [3]:
#getting the html page

url = "https://africafreak.com/100-most-influential-twitter-users-in-africa"
html = urlopen(url)

In [281]:
# #converting the html to a BeautifulSoup object
soup_obj = BeautifulSoup(html, 'lxml')

#extracting the needed headers
headers_30 = [i.text for i in soup_obj.findAll("h2") if '@' in i.text]

In [5]:
#Cleaning the Data

names = []
handle = []
for i in headers_30:
    i = i.split('(')
    i[0] = i[0].split('.')
    handle.append(i[1].replace(')', ''))
    names.append(i[0])
    
names[:10]  #Cleaned name and position data

[['100', ' Jeffrey Gettleman '],
 ['99', ' Africa24 Media '],
 ['98', ' Scapegoat '],
 ['97', ' Africa Check '],
 ['96', ' James Copnall '],
 ['95', ' Online Africa '],
 ['94', ' Patrick Ngowi '],
 ['93', ' DOS African Affairs '],
 ['92', ' MoadowAJE '],
 ['91', ' Brendan Boyle ']]

In [6]:
for i in names:     #removing extra spaces in the name
    i[1] = i[1].strip()

In [7]:
names[:10]

[['100', 'Jeffrey Gettleman'],
 ['99', 'Africa24 Media'],
 ['98', 'Scapegoat'],
 ['97', 'Africa Check'],
 ['96', 'James Copnall'],
 ['95', 'Online Africa'],
 ['94', 'Patrick Ngowi'],
 ['93', 'DOS African Affairs'],
 ['92', 'MoadowAJE'],
 ['91', 'Brendan Boyle']]

In [8]:
handle[:10]   #cleaned influencer twitter handle 

['@gettleman',
 '@a24media',
 '@andiMakinana',
 '@AfricaCheck',
 '@JamesCopnall',
 '@oafrica',
 '@PatrickNgowi',
 '@StateAfrica',
 '@Moadow',
 '@BrendanSAfrica']

In [9]:
#presenting as a DataFrame

pd_name = pd.DataFrame(names, columns=['Position', 'Name', 'Trailing'])
del pd_name['Trailing']

mod_pd = pd_name.apply(lambda x: x.astype(int) if x.name == 'Position' else x)
mod_pd["Handle"] = handle      #adding the handle as a column
# pd_0.set_index('Position', inplace=True)

mod_pd.sort_values(by=['Position'], inplace=True)

mod_pd.reset_index(drop=True, inplace=True)

mod_pd.head(10)    #pandas DataFrame of the influencers and their handles.

Unnamed: 0,Position,Name,Handle
0,1,Trevor Noah,@Trevornoah
1,2,Gareth Cliff,@GarethCliff
2,3,Jacob G,@SAPresident
3,4,News24,@News24
4,5,Julius Sello Malema,@Julius_S_Malema
5,6,Helen Zille,@helenzille
6,7,mailandguardian,@mailandguardian
7,8,5FM,@5FM
8,9,loyiso gola,@loyisogola
9,10,Computicket,@Computicket


In [10]:
mod_pd.to_csv('cleaned_influencers.csv')     #saving the cleaned influencer list

``
Above, I scrapped the top 100 Twitter in Africa, cleaned the data, presented it in a DataFrame format and finally saved it as a csv file: cleaned_influencers.csv
``

## Top African Government Official
In this section, I will be scrapping [African leaders respond to coronavirus… on Twitter](https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa) to obtain the account of influencial African leaders.

In [11]:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content  #.encode(BeautifulSoup.original_encoding)
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)
    
def get_elements(url, tag='',search={}, fname=None):
    """
    Downloads a page specified by the url parameter
    and returns a list of strings, one per tag element
    """
    
    if isinstance(url,str):
        response = simple_get(url)
    else:
        #if already it is a loaded html page
        response = url

    if response is not None:
        html = BeautifulSoup(response, 'html.parser')
        
        res = []
        if tag:    
            for li in html.select(tag):
                for name in li.text.split('\n'):
                    if len(name) > 0:
                        res.append(name.strip())
                       
                
        if search:
            soup = html            
            
            
            r = ''
            if 'find' in search.keys():
                print('findaing',search['find'])
                soup = soup.find(**search['find'])
                r = soup

                
            if 'find_all' in search.keys():
                print('findaing all of',search['find_all'])
                r = soup.find_all(**search['find_all'])
   
            if r:
                for x in list(r):
                    if len(x) > 0:
                        res.extend(x)
            
        return res

    # Raise an exception if we failed to get any data from the url
    raise Exception('Error retrieving contents at {}'.format(url))    
    
    
if get_ipython().__class__.__name__ == '__main__':
    fire(get_tag_elements)

In [12]:
url= 'https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#east-africa'
response = simple_get(url)

In [278]:
res = get_elements(response, search={'find_all':{'class_':'wp-block-embed__wrapper'}})

In [276]:
# #converting res to string

list_str = str(res)

In [1]:
list_str = list_str.split('<blockquote')  # spliting at the blockquote since that is the point where each account can be extracted

In [17]:
import re

series_list = pd.Series(list_str)    #converting the list Series format
pattern = r"\((@.+)\)"

new_list = series_list.str.extract(pattern)

new_list.head(20)

Unnamed: 0,0
0,
1,@EswatiniGovern1
2,@MalawiGovt
3,@hagegeingob
4,@FinanceSC
5,@PresidencyZA
6,@mohzambia
7,@edmnangagwa
8,@MinSantedj
9,@hawelti


So, apparently the data has some NaN values which we would have to drop. I will do that in the next cell, then save the file as a csv format. 

In [18]:
new_list.dropna(inplace=True)    #dropping NaN

new_list.columns = ['Handles']    #renaming the column

In [28]:
new_list.head(10)

Unnamed: 0,Handles
1,@EswatiniGovern1
2,@MalawiGovt
3,@hagegeingob
4,@FinanceSC
5,@PresidencyZA
6,@mohzambia
7,@edmnangagwa
8,@MinSantedj
9,@hawelti
10,@StateHouseKenya


In [29]:
len(new_list)   #the number of handles recovered

36

In [30]:
new_list.to_csv('african_leaders.csv')     #saving the series as a csv

## Analyzing Influencers and Government Officials Tweets

In the next couple of cells, I will be extracting the tweets of  tweets posted by all the 100 influencers and top government officials using tweepy.

In [31]:
#Import the necessary methods from tweepy library  

#install tweepy if you don't have it
#!pip install tweepy
import tweepy
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#sentiment analysis package
#!pip install textblob
from textblob import TextBlob

#general text pre-processor
#!pip install nltk
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')

#tweet pre-processor 
#!pip install tweet-preprocessor
import preprocessor as p

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\OWNER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [32]:
def print_full(x):
  '''
  This is to print nicely DataFrame wide tables
  '''
  pd.set_option('display.max_rows', len(x))
  pd.set_option('display.max_columns', None)
  pd.set_option('display.width', 2000)
  pd.set_option('display.float_format', '{:20,.2f}'.format)
  pd.set_option('display.max_colwidth', -1)
  print(x)
  pd.reset_option('display.max_rows')
  pd.reset_option('display.max_columns')
  pd.reset_option('display.width')
  pd.reset_option('display.float_format')
  pd.reset_option('display.max_colwidth')

In [33]:
consumer_key = ""
consumer_secret = ""
access_token = ""
access_token_secret = ""

In [34]:
# Creating the authentication object
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# Setting your access token and secret
auth.set_access_token(access_token, access_token_secret)
# Creating the API object while passing in auth information
api = tweepy.API(auth, wait_on_rate_limit=True,
                     wait_on_rate_limit_notify=True)

In [47]:
#To obtain valid screen name avoiding Tweepy error 404

names = []
for influencer in mod_pd['Handle']:      
    try:
        u=api.get_user(influencer)
        names.append(u.screen_name)
    except Exception:
            pass
names[:10]

['Trevornoah',
 'GarethCliff',
 'SAPresident',
 'News24',
 'Julius_S_Malema',
 'helenzille',
 'mailandguardian',
 '5FM',
 'loyisogola',
 'Computicket']

In [48]:
len(names)     #seven invalid usernames

93

In [51]:
#getting follower counts

followers = []
for screen_nm in names:
    u=api.get_user(screen_nm)
    followers.append(u.followers_count)
    
followers[:10]

[10801764,
 1974454,
 18,
 3574787,
 3125323,
 1443227,
 1059929,
 1164162,
 1085138,
 200299]

In [52]:
#getting the number of tweets
no_tweets = []
for screen_nm in names:
    u=api.get_user(screen_nm)
    no_tweets.append(u.statuses_count)
    
no_tweets[:10]

[11185, 31623, 19, 322577, 37190, 72335, 144198, 62810, 5653, 45108]

In [53]:
#no of friends influencer is following

friends = []
for screen_nm in names:
    u=api.get_user(screen_nm)
    friends.append(u.friends_count)
    
no_tweets[:10]

[11185, 31623, 19, 322577, 37190, 72335, 144198, 62810, 5653, 45108]

In [54]:
#getting the number of retweets
rt_count = []    
for influencer in names:
    influencers_tweets = api.user_timeline(screen_name =influencer, count=200)
    rt = 0
    for tweet in influencers_tweets:
        rt += tweet.retweet_count
    rt_count.append(rt)

In [69]:
rt_count[:10]

[4609962, 39668, 851, 7710, 143963, 16562, 2979, 779, 560184, 433]

In [70]:
#getting the number of likes
likes_count = []    
for influencer in names:
    influencers_tweets = api.user_timeline(screen_name =influencer, count=200)
    lk = 0
    for tweet in influencers_tweets:
        lk += tweet.favorite_count
    likes_count.append(lk)

In [71]:
likes_count[:10]

[1235122, 13190, 0, 14020, 117026, 15859, 922, 5675, 6642, 134]

In [143]:
#getting the number of comments 


comment_count = []    
for influencer in names:
    influencers_tweets = api.user_timeline(screen_name =influencer, count=200)
    cm = 0
    for tweet in influencers_tweets:
        cm += tweet.reply_count
    comment_count.append(cm)    
    
#! ONLY AVAILABLE FOR PREMIUM USERS ON TWITTER API

In [253]:
#collecting mentions

from datetime import datetime, date, time, timedelta
mentions = []
    
for influencer in mod_pd['Handle']:
  try:
    for status in tweepy.Cursor(api.user_timeline, id=influencer).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  mentions.append(name)
  except Exception:
            pass      

In [254]:
mentions[:10]    #it's a list of their twitter handle not screen name ()

['KingBach',
 'franklinleonard',
 'SawyerHackett',
 'kimlatricejones',
 'DEADLINE',
 'Sensational_Dre',
 'maleeezy_',
 'sarahcpr',
 'michaelharriot',
 'GaryChambersJr']

In [255]:
mention_csv=pd.DataFrame(mentions)
mention_csv.columns=['mention']

mention_csv.to_csv('mention.csv')

mention_csv.head()

Unnamed: 0,mention
0,KingBach
1,franklinleonard
2,SawyerHackett
3,kimlatricejones
4,DEADLINE


In [132]:
hashtags = {}

for influencer in mod_pd['Handle']:  
  hs = []
  try:  
    for status in tweepy.Cursor(api.user_timeline, id=influencer).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "hashtags" in entities:
          for ent in entities["hashtags"]:
            if ent is not None:
              if "text" in ent:
                hashtag = ent["text"]
                if hashtag is not None:
                  hs.append(hashtag)
    hashtags[influencer] = hs
  except Exception:
        pass              

In [157]:
hashtags['@Trevornoah'][:10]   #ten hashtag for Trevornoah

['BlackPeoplePenance',
 'BlackLivesMatter',
 'BlackOutTuesday',
 'JusticeForGeorgeFloyd',
 'CincUp',
 'coronavirus',
 'FallonAtHome',
 'ClubQuarantine',
 'WithMe',
 'AwholeNewSong']

In [158]:
df=pd.DataFrame.from_dict(hashtags, orient='index')
df=df.transpose()
df.head(10)    #the dataframe for 10 hashtage for 

Unnamed: 0,@Trevornoah,@GarethCliff,@SAPresident,@News24,@Julius_S_Malema,@helenzille,@mailandguardian,@5FM,@loyisogola,@Computicket,...,@iamsuede,@JamalMOsman,@KahnMorbee,@artsouthafrica,@malonebarry,@InvestInAfrica,@TheEIU_Africa,@sarzss,@VISI_Mag,@BrendanSAfrica
0,BlackPeoplePenance,SoWhatNow,FridayFelling,TheBacheloretteSA,TBTChallenge,GBV,Eskom,MandelaDay2020,CRYMUN,VanPletzen,...,TheNod,Africa,BoxOffice,Art,Ethiopia,stockexchange,Mali,lockdown,PicksOfTheWeek,StopTheBantustanBills
1,BlackLivesMatter,SoWhatNow,FullLoadWorkPresure,StateCaptureInquiry,SAMA26,Women,67minutes,67Minutes,LaLiga,GMABenefitConcert,...,DefundThePolice,Chinese,meatfreemonday,ArtistOfTheMonthASA,QAnon,tokens,SSA,COVID19,PicksOfTheWeek,Right2SayNo
2,BlackOutTuesday,SoWhatNow,business,MapitiMatsena,Loadshedding,VBSArrests,COVID__19,MandelaDay2020,LaLiga,GMABenefitConcert,...,PVALLEYPremiere,Malawi,CollinsKhosa,ArtSouthAfrica,RIP,DLT,Malawi,coronavirus,VISIDecor,Xolobeni
3,JusticeForGeorgeFloyd,SoWhatNow,Self,MapitiMatsena,Eskom,VBS,EndHighSchoolInSawa,67Minutes,ARSLIV,MissSa2020,...,PVALLEYPremiere,Thread,theshowmustbepaused,LoveOfArt,Syria,blockchain,Zimbabwe,Covid_19,VISIArt,CyrilRamaphosa
4,CincUp,CliffCentralApp,Motivation,MapitiMatsena,EFFTurns7,day78oflockdown,Eritrea,ForbesAndFix,Sopranos,FaceYourPower,...,REDTABLETALK,Ethiopia,notoracism,Painting,GrenfellInquiry,equities,Malawi,coronavirus,VISILifestyle,media
5,coronavirus,GCS,confidence,ExtendTheLicence,EFFOnlineStore,ClassicFawlty,Covid19,UnpopularOpinion,SaveLiveComedy,EmbraceYourFuture,...,WillSmith,Egypt,GeorgeFloyd,Inspiration,BBCAfricaEye,bonds,IEA,USAmbassador,VISIDesign,Bosasa
6,FallonAtHome,SoWhatNow,Tesla,LoadShedding,cyrilramaphosa,FawltyTowers,ServiceDelivery,MidMorningsOn5,ARSLEI,MissSATop15,...,JadaSmith,Somalia,StayAtHomeLive,Retweet,SexForGrades,Ethiopia,Coronavirus,BurkinaFaso,PicksOfTheWeek,MTBPS2018
7,ClubQuarantine,OneMillionGraves,,Covid19,alcoholmustfall,CoronaCast,SouthAfrica,NewMusicFriday,ARSLEI,SARIEGESELS,...,AugustAlsina,Mogadishu,GoodMorningsOFM,Creativity,BREAKING,stockexchange,Nigeria,Covid,PicksOfTheWeek,journalists
8,WithMe,SoWhatNow,,LoadShedding,IMMEDIATE,CoronaCast,Covid19,NewMusicFriday,CRYCHE,1stOfAll,...,soaugust,African,capetown,ArtistOfTheMonthASA,DC,Ghana,B2B,Appalled,VISIDecor,media
9,AwholeNewSong,SoWhatNow,,LoadShedding,EFFTurns7,UCT,EducationMatters,BraaiJuly,SOUMCI,1stOfAll,...,redtabletalk,Britain,flipflopday,ArtSouthAfrica,wedding,IPO,intelligence,Meghan,PicksOfTheWeek,SBSJA18


In [141]:
df.to_csv('hashtag.csv') 

### African Leaders 
All through the cells above, I have been getting information for influencers, in the next couple of cells, I will be obtaining similar information for African leaders and subsequently joining their dataframe to make analysis.  

In [161]:
#To obtain valid screen name avoiding Tweepy error 404

leader_names = []
for leader in new_list['Handles']:      
    try:
        u=api.get_user(leader)
        leader_names.append(u.screen_name)
    except Exception:
            pass
leader_names[:10]

['EswatiniGovern1',
 'MalawiGovt',
 'hagegeingob',
 'FinanceSC',
 'PresidencyZA',
 'mohzambia',
 'edmnangagwa',
 'MinSantedj',
 'hawelti',
 'StateHouseKenya']

In [162]:
len(leader_names)     #no invalid usernames

36

In [165]:
#getting follower counts

leaders_followers = []
for screen_nm in leader_names:
    u=api.get_user(screen_nm)
    leaders_followers.append(u.followers_count)
    
leaders_followers[:10]

[11298, 39220, 192306, 126, 1598283, 7145, 545914, 2928, 66042, 1103172]

In [166]:
#getting the number of tweets

leaders_no_tweets = []
for screen_nm in leader_names:
    u=api.get_user(screen_nm)
    leaders_no_tweets.append(u.statuses_count)
    
leaders_no_tweets[:10]

[1673, 4030, 1085, 125, 18854, 837, 628, 1064, 4714, 9044]

In [167]:
#no of friends leader is following

leader_friends = []
for screen_nm in leader_names:
    u=api.get_user(screen_nm)
    leader_friends.append(u.friends_count)
    
leader_friends[:10]

[82, 26, 55, 224, 14, 95, 116, 127, 434, 214]

In [168]:
#getting the number of retweets
leader_rt_count = []    
for leader in leader_names:
    leaders_tweets = api.user_timeline(screen_name =leader, count=200)
    rt = 0
    for tweet in leaders_tweets:
        rt += tweet.retweet_count
    leader_rt_count.append(rt)

In [169]:
leader_rt_count[:10]

[1098, 5078, 10240, 625, 8691, 18566, 54875, 3536, 11106, 24584]

In [170]:
#getting the number of likes

leader_likes_count = []    
for leader in leader_names:
    leaders_tweets = api.user_timeline(screen_name =leader, count=200)
    lk = 0
    for tweet in leaders_tweets:
        lk += tweet.favorite_count
    leader_likes_count.append(lk)
    
leader_likes_count[:10]

[3555, 9321, 68495, 123, 20408, 1319, 175958, 1596, 25839, 99505]

In [172]:
#getting the number of comments 


leader_comment_count = []    
for leader in leader_names:
    leaders_tweets = api.user_timeline(screen_name =leader, count=200)
    cm = 0
    for tweet in leaders_tweets:
        cm += tweet.reply_count
    leader_comment_count.append(cm)    
    
#! ONLY AVAILABLE FOR PREMIUM USERS ON TWITTER API

In [264]:
#collecting mentions

from datetime import datetime, date, time, timedelta
leaders_mentions = []
    
for leader in new_list['Handles']:
  try:
    for status in tweepy.Cursor(api.user_timeline, id=leader).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  leaders_mentions.append(name)
#       if status.created_at < end_date:
            break
  except Exception:
            pass      

In [265]:
leaders_mentions[:10]    #it's a list of their twitter handle not screen name ()

['EUinEswatini',
 'UNFPAEswatini',
 'TW_Eswatini',
 'UEswatini',
 '_AfricaTimes',
 'UNAIDS',
 'UN',
 'UNFPAEswatini',
 'Rolihlahla93',
 'ngu_nonsh']

In [266]:
leaders_mention_csv=pd.DataFrame(leaders_mentions)
leaders_mention_csv.columns=['mention']

leaders_mention_csv.to_csv('leaders_mention.csv')

leaders_mention_csv.head()

Unnamed: 0,mention
0,EUinEswatini
1,UNFPAEswatini
2,TW_Eswatini
3,UEswatini
4,_AfricaTimes


In [186]:
leaders_hashtags = {}

for leader in new_list['Handles']:  
  hs = []
  try:  
    for status in tweepy.Cursor(api.user_timeline, id=leader).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "hashtags" in entities:
          for ent in entities["hashtags"]:
            if ent is not None:
              if "text" in ent:
                hashtag = ent["text"]
                if hashtag is not None:
                  hs.append(hashtag)
    leaders_hashtags[leader] = hs
  except Exception:
        pass            

In [189]:
leaders_hashtags['@MalawiGovt'][:10]

['MalawiCabinet',
 'MalawiCabinet',
 'MalawiCabinet',
 'MalawiCabinet',
 'MalawiCabinet',
 'MalawiCabinet',
 'IndependenceDay',
 'Malawi',
 'InaugurationMalawi2020',
 'InaugurationMalawi2020']

In [190]:
df1=pd.DataFrame.from_dict(leaders_hashtags, orient='index')
df1=df1.transpose()
df1.head(10)    #the dataframe for 10 hashtage for 

Unnamed: 0,@EswatiniGovern1,@MalawiGovt,@hagegeingob,@FinanceSC,@PresidencyZA,@mohzambia,@edmnangagwa,@MinSantedj,@hawelti,@StateHouseKenya,...,@NAkufoAddo,@President_GN,@USEmbalo,@PresidenceMali,@CheikhGhazouani,@IssoufouMhm,@MBuhari,@Macky_Sall,@PresidentBio,@MSPS_Togo
0,COVID19,MalawiCabinet,ZindziMandela,COVID19,NelsonMandelaInternationalDay,COVID19,4thOfJuly,COVID19,Eritrea,Covid19,...,GovtThisWeek,COVID19,Portugal,ORTM,,ZLECAf,DemocracyDay2020,ConseildesMinistresSn,SierraLeone,BigData
1,COVID19,MalawiCabinet,DRC,FA4JR,NelsonMandelaInternationalDay,COVID19,AfricaDay2020,Hackathon,Eritrea,Covid_19,...,4MoreToDoMoreForYou,Guinea,Guiné,Urgent,,13Mai,PTFCOVID19,ConseildesMinistresSn,SierraLeone,esanté
2,COVID19,MalawiCabinet,HappyFathersDay2020,COVID19,NelsonMandelaInternationalDay,COVID19,mothersday2020,France,Eritrea,JukumuniSisi,...,GovtThisWeek,Fêtedutravail,CEDEAO,ORTM,,UAE,COVID19,ConseildesMinistresSn,COVID19,sante
3,COVID19,MalawiCabinet,Periscope,FA4JR,NelsonMandelaInternationalDay,ParentingMonth,COVID19,Djibouti,Eritrea,KomeshaCorona,...,4MoreToDoMoreForYou,Guinée,Guiné,Urgent,,EAU,COVID19,ConseildesMinistresSn,DPGAlliance,Lassa
4,COVID19,MalawiCabinet,NamSONA2020,COVID19,NelsonMandelaInternationalDay,COVID19,COVID19,France,Eritrea,COVIDー19,...,RegisterToVote2020,Ramadan,Sénégal,Mali,,1erMai,COVID19,IGE,SierraLeone,Togo
5,COVID19,MalawiCabinet,NamSONA2020,FA4JR,NelsonMandelaInternationalDay,COVID19,NationalCommandCouncil,Djibouti,Eritrea,JukumuniSisi,...,4MoreToDoMoreForYou,Ramadan,Covid_19,Mali,,Niger,PTFCOVID19,diaspora,BudapestBamako,santé
6,COVID19,IndependenceDay,NamSONA2020,COVID19,Presidentialimbizo,WHA73,FélixTshisekedi,Djibouti,Teleferica,CoronaVirus,...,RegisterToVote2020,Guinée,Covid_19,IBK,,GlobalVaccine,Coronavirus,Covid19sn,COVID19,Team228
7,COVID19,Malawi,AU,FA4JR,PresidentialImbizo,EPICENTER,COVID19,COVID19,Italians,KomeshaCorona,...,4MoreToDoMoreForYou,coronavirus,CoronaPandemie,Urgent,,COVID19,PTFCOVID19,ConseildesMinistresSn,SierraLeone,TgInfo
8,COVID19,InaugurationMalawi2020,AfricaDay,COVID19,PresidentialImbizo,COVID,ZimLockdown,COVID19,Eritrea,KomeshaCorona,...,RegisterToVote2020,Coronavirus,GuineaBissau,Mali,,CEDEAO,thread,ConseildesMinistresSn,coronavirus,Santé
9,COVID19,InaugurationMalawi2020,Agenda2063,FA4JR,ZindziMandela,Zambia,COVID19,COVID19,Eritrea,Coronavirus,...,4MoreToDoMoreForYou,coronavirus,AsoVillaToday,Mali,,G5Sahel,30yearsweddinganniversary,ConseildesMinistresSn,COVID,OMS


In [191]:
df1.to_csv('leaders_hashtag.csv') 

### Calculating popularity_score, reach_score and relevance_scores.
In the cells above, I have gotten the information for both the government and influencers handles. In the next couple of cells, I will be calculating their popularity_score, reach_score and relevance_scores 

In [196]:
#joining their screen names as a list

total_name = names + leader_names
len(total_name)   ##93+36 (7 invalid influeuncer handles)

129

In [198]:
#joining their retweet count

total_retweet = rt_count + leader_rt_count
len(total_retweet)

129

In [199]:
#joining their number of likes

total_like = likes_count + leader_likes_count
len(total_like)

129

In [200]:
#joining their follower count

total_follower = followers + leaders_followers
len(total_follower)

129

In [201]:
#joining their following count

total_following = friends + leader_friends
len(total_following)

129

In [267]:
# To get mentions, I need to extract their user name from the list I got above

total_mentions = mentions + leaders_mentions
mention_dict = {}
for name in total_name:
    mn = 0
    for mention in total_mentions:
        if name == mention:
            mn += 1
    mention_dict[name] = mn

len(mention_dict)    

129

In [257]:
#forming dataframe from the total lists above

total_df = pd.DataFrame(list(zip(total_name, total_retweet, total_like, total_follower, total_following)), 
                        columns=['handle', 'retweet_count', 'like_count', 'follower_count', 'following_count'])

total_df.head()

Unnamed: 0,handle,retweet_count,like_count,follower_count,following_count
0,Trevornoah,4609962,1235122,10801764,325
1,GarethCliff,39668,13190,1974454,356
2,SAPresident,851,0,18,14
3,News24,7710,14020,3574787,632
4,Julius_S_Malema,143963,117026,3125323,652


In [268]:
#forming dataframe from mention dict

mention_df=pd.DataFrame.from_dict(mention_dict, orient='index').reset_index()

mention_df.columns=['handle', 'mention_count']
mention_df.head(5)

Unnamed: 0,handle,mention_count
0,Trevornoah,200
1,GarethCliff,516
2,SAPresident,4
3,News24,68
4,Julius_S_Malema,182


In [269]:
len(mention_df)

129

In [270]:
#joining the mention dataframe and total dataframe on the handle column

final_df = pd.merge(total_df, mention_df, on='handle')
final_df.head()

Unnamed: 0,handle,retweet_count,like_count,follower_count,following_count,mention_count
0,Trevornoah,4609962,1235122,10801764,325,200
1,GarethCliff,39668,13190,1974454,356,516
2,SAPresident,851,0,18,14,4
3,News24,7710,14020,3574787,632,68
4,Julius_S_Malema,143963,117026,3125323,652,182


In [271]:
#calculating popularity_score
final_df['popularity_score'] = final_df['retweet_count'] + final_df['like_count']

#calculating reach_score
final_df['reach_score'] = final_df['follower_count'] - final_df['following_count']

#I can not calculate relevance_score since I could not access the number of comments: its only available for premium members
final_df.head()

Unnamed: 0,handle,retweet_count,like_count,follower_count,following_count,mention_count,popularity_score,reach_score
0,Trevornoah,4609962,1235122,10801764,325,200,5845084,10801439
1,GarethCliff,39668,13190,1974454,356,516,52858,1974098
2,SAPresident,851,0,18,14,4,851,4
3,News24,7710,14020,3574787,632,68,21730,3574155
4,Julius_S_Malema,143963,117026,3125323,652,182,260989,3124671


In [273]:
final_df.sort_values(by=['popularity_score'], ascending=False).head(10)

Unnamed: 0,handle,retweet_count,like_count,follower_count,following_count,mention_count,popularity_score,reach_score
0,Trevornoah,4609962,1235122,10801764,325,200,5845084,10801439
119,NAkufoAddo,152059,880533,1506364,352,172,1032592,1506012
28,JendayiFrazer,897526,12,29941,480,1,897538,29461
103,PaulKagame,146594,542067,1982615,181,76,688661,1982434
8,loyisogola,560184,6642,1085138,3948,215,566826,1081190
108,KagutaMuseveni,82943,465241,1811633,28,31,548184,1811605
19,hartleyr,503546,388,35218,455,2,503934,34763
47,schneiderhome,477706,12272,26399,1007,0,489978,25392
125,MBuhari,89771,362189,3270290,26,51,451960,3270264
58,africagathering,419307,145,48956,653,0,419452,48303


In [274]:
final_df.sort_values(by=['reach_score'], ascending=False).head(10)

Unnamed: 0,handle,retweet_count,like_count,follower_count,following_count,mention_count,popularity_score,reach_score
0,Trevornoah,4609962,1235122,10801764,325,200,5845084,10801439
3,News24,7710,14020,3574787,632,68,21730,3574155
125,MBuhari,89771,362189,3270290,26,51,451960,3270264
4,Julius_S_Malema,143963,117026,3125323,652,182,260989,3124671
103,PaulKagame,146594,542067,1982615,181,76,688661,1982434
1,GarethCliff,39668,13190,1974454,356,516,52858,1974098
108,KagutaMuseveni,82943,465241,1811633,28,31,548184,1811605
13,euphonik,6344,48170,1753982,65,69,54514,1753917
97,PresidencyZA,8691,20408,1598283,14,50,29099,1598269
119,NAkufoAddo,152059,880533,1506364,352,172,1032592,1506012


In [275]:
final_df.sort_values(by=['mention_count'], ascending=False).head(10)

Unnamed: 0,handle,retweet_count,like_count,follower_count,following_count,mention_count,popularity_score,reach_score
12,UlrichJvV,22535,42897,1042588,530434,1024,65432,512154
1,GarethCliff,39668,13190,1974454,356,516,52858,1974098
6,mailandguardian,2979,922,1059929,479,323,3901,1059450
8,loyisogola,560184,6642,1085138,3948,215,566826,1081190
0,Trevornoah,4609962,1235122,10801764,325,200,5845084,10801439
126,Macky_Sall,29642,199737,1374324,171,184,229379,1374153
4,Julius_S_Malema,143963,117026,3125323,652,182,260989,3124671
10,MTVBaseAfrica,26382,13268,1415550,109,178,39650,1415441
119,NAkufoAddo,152059,880533,1506364,352,172,1032592,1506012
37,MbuyiseniNdlozi,122690,111374,1073810,5474,121,234064,1068336


#### Conclusion
In this section, I am able to calculate popularity_score and reach_score for the leaders and influencers based off the data I got from Twitter API. I also went ahead to sort them based on these metrics which gave interesting results.
Based off this finding, I made a presentation to a fictional Marketing strategy manager at NIKE, lol, which you can access [here](https://docs.google.com/presentation/d/15Derm9ZtOHzhPLM4o1WUD1DZV8culghr8busVNmliLg/edit?usp=sharing)

<br /> Cheers!

## Analysis Based on Hashtags



``
The next part of this project is to extract unique five hashtags for the influencers and leaders and group them based on it. I will continue from here on a later date.
``

Cheers!
<br /> Lawal Ogunfowora <br />
Lawal1998@yahoo.com