<a href="https://colab.research.google.com/github/mikael-daniels/team11_pipeline/blob/master/Transforming.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building an ETL Pipeline

As the second part of the predict for Gather, you will need to build a pipeline of functions in python which does the following:

1. Function to connect to twitter and scrapes "Eskom_SA" tweets.
<br>
<br>
2. Cleans/Processes the tweets from the scraped tweets which will create a dataframe with two new columns using the following functions: <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a) Hashtag Remover from Analyse Functions
<br>
<br>
3. Functions which connects to your SQL database and uploads the tweets into the table you store the tweets in the database.

In [0]:
# General:
import tweepy           # To consume Twitter's API
import pandas as pd     # To handle data
import numpy as np      # For numerical computation
import json
# For plotting and visualization:
from IPython.display import display
#import pyodbc


# Consumer and Access details

Fill in your Consumer and Access details you should have recieved when applying for a Twitter API. 

In [0]:
# Consumer:
CONSUMER_KEY    = '**************************'
CONSUMER_SECRET = '**************************************'

# Access:
ACCESS_TOKEN  = '*********************************************'
ACCESS_SECRET = '*********************************************'

In [0]:
# API's setup:
def twitter_setup():
    """
    Utility function to setup the Twitter's API
    with access and consumer keys from Twitter.
    """

    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    # Return API with authentication:
    api = tweepy.API(auth, timeout=1000)
    return api

# Function 1:

Write a function which:
- Scrapes _"Eskom_SA"_ tweets from Twitter. 

Function Specifications:
- The function should return a dataframe with the scraped tweets with just the "_Tweets_" and "_Date_". 
- Will take in the ```consumer key,  consumer secret code, access token``` and ```access secret code```.

NOTE:
The dataframe should have the same column names as those in your SQL Database table where you store the tweets.

In [0]:
def twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET ):
    
    ''' Return a new dataframe with data scraped from Twitter
        
        Args: 'CONSUMER_KEY', 'CONSUMER_SECRET', 'ACCESS_TOKEN', 'ACCESS_SECRET'

        Returns: Dataframe with two columns: 'Tweets' and 'Date'
    ''' 
    # Code Here
    extractor = twitter_setup()
    
    # We create a tweet list as follows:
    tweets = extractor.user_timeline(screen_name="Eskom_SA", 
                                     count=100,
                                     include_rts=False)
    print(f"Number of tweets extracted: {len(tweets)}.\n")
    
    # We create a pandas dataframe as follows:
    data = pd.DataFrame(data=np.column_stack([[tweet.text for tweet in tweets],
                                              [(tweet.created_at) for tweet in tweets]]),
                                              columns=['Tweets','Date'])
    # We display the first 10 elements of the dataframe:
    #final_df = data.head(10)
    
    return data

# Function 2: Removing hashtags and the municipalities

Write a function which:
- Uses the function you wrote in the Analyse section to extract the hashtags and municipalities into it's own column in a new data frame. 

Function Specifications:
- The function should take in the pandas dataframe you created in Function 1 and return a new pandas dataframe. 

In [0]:
twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET )

Number of tweets extracted: 96.



Unnamed: 0,Tweets,Date
0,Meter tampering is when you cause the meter to...,2020-03-06 08:30:00
1,Interfering with or vandalising Eskom property...,2020-03-06 06:34:00
2,#POWERALERT 1\n\nDate: 06 March 2020\n\nNo loa...,2020-03-06 05:54:58
3,#Eskom #MediaStatement\n\nEskom notes allegati...,2020-03-05 19:28:53
4,DYK that you can report #poweroutages on MyEsk...,2020-03-05 16:00:00
...,...,...
91,Only Eskom trained and authorized personnel ar...,2020-02-26 08:20:00
92,#ESKOMFREESTATE #MEDIASTATEMENT\n\nNOTICE TO S...,2020-02-26 07:35:41
93,@Bonolo_maphutha Thanks. We will follow-up.,2020-02-26 07:27:41
94,@tboymunyai @GautengProvince @GautengANC @EFFS...,2020-02-26 07:26:16


In [0]:
def extract_municipality_hashtags(df):
    
    ''' Return a dataframe with hashtags and municipalities extracted, using dataframe obtained from Function 1

        Args: 'df', Dataframe created by Function 1

        Returns: Original dataframe, 'df', with two new columns, 'hashtags' and 'municipality'
    '''
    mun_dict = {'@CityofCTAlerts' : 'Cape Town',
                '@CityPowerJhb' : 'Johannesburg',
                '@eThekwiniM' : 'eThekwini' ,
                '@EMMInfo' : 'Ekurhuleni',
                '@centlecutility' : 'Mangaung',
                '@NMBmunicipality' : 'Nelson Mandela Bay',
                '@CityTshwane' : 'Tshwane'}

 
    l = 0
    for tweet in df['Tweets']:
      tweet = tweet.split(' ')
      for key in mun_dict.keys():
        if key in tweet:
          df.loc[l, 'municipality'] = mun_dict[key]       
      l += 1

    # Create 'hashtags' column: Mikael
    df['hashtags'] = df['Tweets'].str.lower().str.split()

    # Extract hashtags from Tweets: Monica
    hashtagslist = []
    i = 0
    for tweet in df['hashtags']:
      hashtags = []
      for word in tweet:
        if word.startswith('#'):
          hashtags.append(word)
      hashtagslist.append(hashtags)
      i += 1
    
    df['hashtags'] = hashtagslist
    # Fill empty values in 'hashtags' columns with np.nan: Courtney

    return df


In [0]:
extract_municipality_hashtags(twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET))

Number of tweets extracted: 96.



Unnamed: 0,Tweets,Date,municipality,hashtags
0,Meter tampering is when you cause the meter to...,2020-03-06 08:30:00,,[]
1,Interfering with or vandalising Eskom property...,2020-03-06 06:34:00,,[]
2,#POWERALERT 1\n\nDate: 06 March 2020\n\nNo loa...,2020-03-06 05:54:58,,[#poweralert]
3,#Eskom #MediaStatement\n\nEskom notes allegati...,2020-03-05 19:28:53,,"[#eskom, #mediastatement]"
4,DYK that you can report #poweroutages on MyEsk...,2020-03-05 16:00:00,,[#poweroutages]
...,...,...,...,...
91,Only Eskom trained and authorized personnel ar...,2020-02-26 08:20:00,,[]
92,#ESKOMFREESTATE #MEDIASTATEMENT\n\nNOTICE TO S...,2020-02-26 07:35:41,,"[#eskomfreestate, #mediastatement]"
93,@Bonolo_maphutha Thanks. We will follow-up.,2020-02-26 07:27:41,,[]
94,@tboymunyai @GautengProvince @GautengANC @EFFS...,2020-02-26 07:26:16,,[]
