<a href="https://colab.research.google.com/github/mikael-daniels/team11_pipeline/blob/Loading/Loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building an ETL Pipeline

As the second part of the predict for Gather, you will need to build a pipeline of functions in python which does the following:

1. Function to connect to twitter and scrapes "Eskom_SA" tweets.
<br>
<br>
2. Cleans/Processes the tweets from the scraped tweets which will create a dataframe with two new columns using the following functions: <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a) Hashtag Remover from Analyse Functions
<br>
<br>
3. Functions which connects to your SQL database and uploads the tweets into the table you store the tweets in the database.

In [0]:
# General:
import tweepy           # To consume Twitter's API
import pandas as pd     # To handle data
import numpy as np      # For numerical computation
import json
# For plotting and visualization:
from IPython.display import display
import pyodbc


# Consumer and Access details

Fill in your Consumer and Access details you should have recieved when applying for a Twitter API. 

In [0]:
# Consumer:
CONSUMER_KEY    = 'PDU7fauXIq4hZRdkJ62s3OYPT'
CONSUMER_SECRET = 'D118AztFQ9y6irrMOKLKxuU6XLKiYznIZ5d93B4ZnO5SaAzHsB'

# Access:
ACCESS_TOKEN  = '2954682544-Ttwu770muNrhg4KQSY6N9RYGuK33Xs2V80q8Ms8'
ACCESS_SECRET = '95idH7IMmvwO74eOwnDGVasxGZ4E8ftEoGEYPhfGWOmKh'

In [0]:
# API's setup:
def twitter_setup():
    """
    Utility function to setup the Twitter's API
    with access and consumer keys from Twitter.
    """

    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    # Return API with authentication:
    api = tweepy.API(auth, timeout=1000)
    return api

# Function 1:

Write a function which:
- Scrapes _"Eskom_SA"_ tweets from Twitter. 

Function Specifications:
- The function should return a dataframe with the scraped tweets with just the "_Tweets_" and "_Date_". 
- Will take in the ```consumer key,  consumer secret code, access token``` and ```access secret code```.

NOTE:
The dataframe should have the same column names as those in your SQL Database table where you store the tweets.

In [0]:
def twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET ):
    
    ''' Return a new dataframe with data scraped from Twitter
        
        Args: 'CONSUMER_KEY', 'CONSUMER_SECRET', 'ACCESS_TOKEN', 'ACCESS_SECRET'

        Returns: Dataframe with two columns: 'Tweets' and 'Date'
    ''' 
    # Code Here
    extractor = twitter_setup()
    
    # We create a tweet list as follows:
    tweets = extractor.user_timeline(screen_name="Eskom_SA", 
                                     count=100,
                                     include_rts=False)
    print(f"Number of tweets extracted: {len(tweets)}.\n")
    
    # We create a pandas dataframe as follows:
    data = pd.DataFrame(data=np.column_stack([[tweet.text for tweet in tweets],
                                              [(tweet.created_at) for tweet in tweets]]),
                                              columns=['Tweets','Date'])
    # We display the first 10 elements of the dataframe:
    #final_df = data.head(10)
    
    return data

# Function 2: Removing hashtags and the municipalities

Write a function which:
- Uses the function you wrote in the Analyse section to extract the hashtags and municipalities into it's own column in a new data frame. 

Function Specifications:
- The function should take in the pandas dataframe you created in Function 1 and return a new pandas dataframe. 

In [0]:
twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET )

Number of tweets extracted: 93.



Unnamed: 0,Tweets,Date
0,"@TSIKALIRA Hi, pls DM us your contact details ...",2020-03-10 06:32:50
1,#POWERALERT 1\nDate: 10 March 2020\n\nStage 2 ...,2020-03-10 05:37:17
2,Protect your electrical appliances by unpluggi...,2020-03-10 04:30:00
3,@Jeandre_S @CityTshwane @SABCNewsOnline @SABCR...,2020-03-09 17:32:41
4,@Antifreez777za @SikonathiM Please see our tim...,2020-03-09 17:26:55
...,...,...
88,"Before you leave the house, switch off or unpl...",2020-03-02 04:10:00
89,#POWERALERT 1\nDate: 01 March 2020\n\nNo loads...,2020-03-01 17:04:06
90,Report electricity theft anonymously to Eskom ...,2020-03-01 16:30:00
91,#PublicSafety: Make sure that you and your fam...,2020-03-01 13:06:00


In [0]:
def extract_municipality_hashtags(df):
    
    ''' Return a dataframe with hashtags and municipalities extracted, using dataframe obtained from Function 1

        Args: 'df', Dataframe created by Function 1

        Returns: Original dataframe, 'df', with two new columns, 'hashtags' and 'municipality'
    '''
    mun_dict = {'@CityofCTAlerts' : 'Cape Town',
                '@CityPowerJhb' : 'Johannesburg',
                '@eThekwiniM' : 'eThekwini' ,
                '@EMMInfo' : 'Ekurhuleni',
                '@centlecutility' : 'Mangaung',
                '@NMBmunicipality' : 'Nelson Mandela Bay',
                '@CityTshwane' : 'Tshwane'}

 
    l = 0
    for tweet in df['Tweets']:
      tweet = tweet.split(' ')
      for key in mun_dict.keys():
        if key in tweet:
          df.loc[l, 'municipality'] = mun_dict[key]       
      l += 1

    # Create 'hashtags' column: Mikael
    df['hashtags'] = df['Tweets'].str.lower().str.split()

    # Extract hashtags from Tweets: Monica
    hashtagslist = []
    i = 0
    for tweet in df['hashtags']:
      hashtags = []
      for word in tweet:
        if word.startswith('#'):
          hashtags.append(word)
      hashtagslist.append(hashtags)
      i += 1
    
    df['hashtags'] = hashtagslist
    # Fill empty values in 'hashtags' columns with np.nan: Courtney

    return df

# Function 3: Updating SQL Database with pyODBC

Write a function which:
- Connects and updates your SQL database. 

Function Specifications:
- The function should take in a pandas dataframe created in Function 2. 
- Connect to your SQL database.
- Update the table you store your tweets in.
- Not return any output.

In [0]:
def pyodbc_twitter(connection, df, twitter_table):

    ''' Update 'Tweets' table in our SQL database

      Args: 'connection', 'df', 'twitter_table'

      Returns: None, only updates the 'Twitter' table
    '''

    ### Code Here
    #loading: Mikael
    df = pd.read_sql_query('select * from dbo.all_stations',connection)
    #updating tables in SQL: Monica

    return df.head()

In [0]:
connection = pyodbc.connect(driver='{SQL Server}',
                      host='EDSA-3GKPSM2\SQLEXPRESSMON',
                      database='gather_eskom_cleaned_columns',
                      trusted_connection='tcon',
                      user='sa')

In [0]:
df = twitter_df(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET )

Number of tweets extracted: 93.



In [0]:
twitter_table = extract_municipality_hashtags(df)

In [0]:
pyodbc_twitter(connection, df, twitter_table)

Unnamed: 0,station_name,total_installed_capacity,total_nominal_capacity,location,station_type
0,Acacia,171,171,Cape Town,Gas/liquid fuel turbine stations
1,Ankerlig,1338,1327,Atlantis,Gas/liquid fuel turbine stations
2,Arnot,2352,2232,Middelburg,Coal
3,Camden',1561,1481,Ermelo,Coal
4,Colley Wobbles,42,0,Mbashe River,Hydroelectric stations
