# Functions That Will Be Used To Analyse Data

Functions are important in reducing the replication of code as well as giving the user the functionality of getting an ouput on varying inputs. The functions below use Eskom data/variables.

These functions are:

Metric Dictionary
Five Number Summary Dictionary
Date Parser
Hashtag & Municipality Remover
Number of Tweets per Day
Word Splitter
Stopwords & Link Remover

uthors: Nthabeleng Vilakazi, Refiloe Phipha, Neliswe Mabanga, Jaganeth Chetty and Sevha Vukeya

## Imports

In [1]:
import pandas as pd
import numpy as np

## Data Loading and Preprocessing

### Electricification by province (EBP) data

In [2]:
ebp_url = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/electrification_by_province.csv'
ebp_df = pd.read_csv(ebp_url)

for col, row in ebp_df.iloc[:,1:].iteritems():
    ebp_df[col] = ebp_df[col].str.replace(',','').astype(int)

ebp_df.head()

Unnamed: 0,Financial Year (1 April - 30 March),Limpopo,Mpumalanga,North west,Free State,Kwazulu Natal,Eastern Cape,Western Cape,Northern Cape,Gauteng
0,2000/1,51860,28365,48429,21293,63413,49008,48429,6168,39660
1,2001/2,68121,26303,38685,20928,64123,45773,38685,10359,36024
2,2002/3,49881,11976,28532,10316,63078,55748,28532,6869,32127
3,2003/4,42034,33515,34027,16135,60282,47414,34027,10976,39488
4,2004/5,54646,16218,21450,5668,37811,42041,21450,6316,18422


### Twitter data

In [4]:
twitter_url = 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/twitter_nov_2019.csv'
twitter_df = pd.read_csv(twitter_url)
twitter_df.head()

Unnamed: 0,Tweets,Date
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43


## Important Variables

In [5]:
# gauteng ebp data as a list
gauteng = ebp_df['Gauteng'].astype(float).to_list()

# dates for twitter tweets
dates = twitter_df['Date'].to_list()

# dictionary mapping official municipality twitter handles to the municipality name
mun_dict = {
    '@CityofCTAlerts' : 'Cape Town',
    '@CityPowerJhb' : 'Johannesburg',
    '@eThekwiniM' : 'eThekwini' ,
    '@EMMInfo' : 'Ekurhuleni',
    '@centlecutility' : 'Mangaung',
    '@NMBmunicipality' : 'Nelson Mandela Bay',
    '@CityTshwane' : 'Tshwane'
}

# dictionary of english stopwords
stop_words_dict = {
    'stopwords':[
        'where', 'done', 'if', 'before', 'll', 'very', 'keep', 'something', 'nothing', 'thereupon', 
        'may', 'why', 'â€™s', 'therefore', 'you', 'with', 'towards', 'make', 'really', 'few', 'former', 
        'during', 'mine', 'do', 'would', 'of', 'off', 'six', 'yourself', 'becoming', 'through', 
        'seeming', 'hence', 'us', 'anywhere', 'regarding', 'whole', 'down', 'seem', 'whereas', 'to', 
        'their', 'various', 'thereafter', 'â€˜d', 'above', 'put', 'sometime', 'moreover', 'whoever', 'although', 
        'at', 'four', 'each', 'among', 'whatever', 'any', 'anyhow', 'herein', 'become', 'last', 'between', 'still', 
        'was', 'almost', 'twelve', 'used', 'who', 'go', 'not', 'enough', 'well', 'â€™ve', 'might', 'see', 'whose', 
        'everywhere', 'yourselves', 'across', 'myself', 'further', 'did', 'then', 'is', 'except', 'up', 'take', 
        'became', 'however', 'many', 'thence', 'onto', 'â€˜m', 'my', 'own', 'must', 'wherein', 'elsewhere', 'behind', 
        'becomes', 'alone', 'due', 'being', 'neither', 'a', 'over', 'beside', 'fifteen', 'meanwhile', 'upon', 'next', 
        'forty', 'what', 'less', 'and', 'please', 'toward', 'about', 'below', 'hereafter', 'whether', 'yet', 'nor', 
        'against', 'whereupon', 'top', 'first', 'three', 'show', 'per', 'five', 'two', 'ourselves', 'whenever', 
        'get', 'thereby', 'noone', 'had', 'now', 'everyone', 'everything', 'nowhere', 'ca', 'though', 'least', 
        'so', 'both', 'otherwise', 'whereby', 'unless', 'somewhere', 'give', 'formerly', 'â€™d', 'under', 
        'while', 'empty', 'doing', 'besides', 'thus', 'this', 'anyone', 'its', 'after', 'bottom', 'call', 
        'nâ€™t', 'name', 'even', 'eleven', 'by', 'from', 'when', 'or', 'anyway', 'how', 'the', 'all', 
        'much', 'another', 'since', 'hundred', 'serious', 'â€˜ve', 'ever', 'out', 'full', 'themselves', 
        'been', 'in', "'d", 'wherever', 'part', 'someone', 'therein', 'can', 'seemed', 'hereby', 'others', 
        "'s", "'re", 'most', 'one', "n't", 'into', 'some', 'will', 'these', 'twenty', 'here', 'as', 'nobody', 
        'also', 'along', 'than', 'anything', 'he', 'there', 'does', 'we', 'â€™ll', 'latterly', 'are', 'ten', 
        'hers', 'should', 'they', 'â€˜s', 'either', 'am', 'be', 'perhaps', 'â€™re', 'only', 'namely', 'sixty', 
        'made', "'m", 'always', 'those', 'have', 'again', 'her', 'once', 'ours', 'herself', 'else', 'has', 'nine', 
        'more', 'sometimes', 'your', 'yours', 'that', 'around', 'his', 'indeed', 'mostly', 'cannot', 'â€˜ll', 'too', 
        'seems', 'â€™m', 'himself', 'latter', 'whither', 'amount', 'other', 'nevertheless', 'whom', 'for', 'somehow', 
        'beforehand', 'just', 'an', 'beyond', 'amongst', 'none', "'ve", 'say', 'via', 'but', 'often', 're', 'our', 
        'because', 'rather', 'using', 'without', 'throughout', 'on', 'she', 'never', 'eight', 'no', 'hereupon', 
        'them', 'whereafter', 'quite', 'which', 'move', 'thru', 'until', 'afterwards', 'fifty', 'i', 'itself', 'nâ€˜t',
        'him', 'could', 'front', 'within', 'â€˜re', 'back', 'such', 'already', 'several', 'side', 'whence', 'me', 
        'same', 'were', 'it', 'every', 'third', 'together'
    ]
}

## Function 1: Metric Dictionary

This function calculates the mean, median, variance, standard deviation, minimum and maximum of of list of items. You can assume the given list is contains only numerical entries.

**Function Specifications:**
- Function should allow a list as input.
- It should return a `dict` with keys `'mean'`, `'median'`, `'std'`, `'var'`, `'min'`, and `'max'`, corresponding to the mean, median, standard deviation, variance, minimum and maximum of the input list, respectively.
- The standard deviation and variance values must be unbiased. 

In [6]:
def dictionary_of_metrics(items):
    # calculating mean
    total = 0
    for i in items:
        total += i
    mean = total/len(items)
    
    # maximum
    maximum = items[0]
    for i in items:
        if i > maximum:
            maximum = i
    
    # minimum
    minimum = items[0]
    for i in items:
        if i < minimum:
            minimum = i
    
    #median
    for i in range(len(items)):
        for k in range(len(items)-i-1):
            if items[k]>items[k+1]:
                items[k],items[k+1] = items[k+1],items[k]
     
    index_median = int((len(items)+1)/2)
    if len(items)%2 == 0:
        median = (items[index_median]+items[index_median-1])/2
    else:
        median = items[index_median]
    
    #variance
    some = 0
    for i in items:
        whole = (i-mean)**2
        some += whole
    var = some/(len(items)-1)
    
    #standard deviation
    std = var**0.5
    
    dict_word = {'mean':mean,
                 'median' : median,
                 'variance' :var,
                 'standard deviation': std,
                 'min': minimum,
                 'max': maximum
                 }
    return dict_word
dictionary_of_metrics(gauteng)

{'mean': 26244.416666666668,
 'median': 24403.5,
 'variance': 108160153.1742424,
 'standard deviation': 10400.007364143663,
 'min': 8842.0,
 'max': 39660.0}

## Function 2: Five Number Summary

This function takes in a list of integers and returns a dictionary of the [five number summary.](https://www.statisticshowto.datasciencecentral.com/how-to-find-a-five-number-summary-in-statistics/).

**Function Specifications:**
- The function should take a list as input.
- The function should return a `dict` with keys `'max'`, `'median'`, `'min'`, `'q1'`, and `'q3'` corresponding to the maximum, median, minimum, first quartile and third quartile, respectively. 

In [7]:
### START FUNCTION
def five_num_summary(items):
    # your code here
    return {'max': np.max(items), 'median': np.median(items), 'min': np.min(items), 'q1': np.percentile(items, 25), 'q3': np.percentile(items, 75)}
### END FUNCTION

In [8]:
five_num_summary(gauteng)

{'max': 39660.0,
 'median': 24403.5,
 'min': 8842.0,
 'q1': 18653.0,
 'q3': 36372.0}

## Function 3: Date Parser

The `dates` variable (created at the top of this notebook) is a list of dates represented as strings. The string contains the date in `'yyyy-mm-dd'` format, as well as the time in `hh:mm:ss` formamt. The first three entries in this variable are:
```python
dates[:3] == [
    '2019-11-29 12:50:54',
    '2019-11-29 12:46:53',
    '2019-11-29 12:46:10'
]
```

The function below takes as input a list of these datetime strings and returns only the date in `'yyyy-mm-dd'` format.

**Function Specifications:**
- The function should take a list of strings as input.
- Each string in the input list is formatted as `'yyyy-mm-dd hh:mm:ss'`.
- The function should return a list of strings where each element in the returned list contains only the date in the `'yyyy-mm-dd'` format.

In [9]:
### START FUNCTION
def date_parser(dates):
    # your code here
    dates_only =[]
    for i in dates:
        date = i[0:10]
        dates_only.append(date)   
    return dates_only
### END FUNCTION

In [10]:
date_parser(dates[:3])

['2019-11-29', '2019-11-29', '2019-11-29']

## Function 4: Municipality & Hashtag Detector

This function takes in a pandas dataframe and returns a modified dataframe that includes two new columns that contain information about the municipality and hashtag of the tweet.

**Function Specifications:**
* Function should take a pandas `dataframe` as input.
* Extract the municipality from a tweet using the `mun_dict` dictonary given below, and insert the result into a new column named `'municipality'` in the same dataframe.
* Use the entry `np.nan` when a municipality is not found.
* Extract a list of hashtags from a tweet into a new column named `'hashtags'` in the same dataframe.

In [11]:
### START FUNCTION
def extract_municipality_hashtags(df):
    # your code here
    muni = list(mun_dict.keys())
    df1 = df.copy()
    cities = []
    hashtags = []
    
    Tweets = list(df1['Tweets']) # Place Tweets in List
    Tweets_Split = [] # Empty List for appending lists of words for each tweet
    for Tweet in Tweets: # Loop to go through every Tweet in list
        Tweets_Split.append(Tweet.lower().split()) # Append the split words to empty list created above
    
    for tweet in Tweets_Split: # Go to each tweet in the split words list
        city1 = '' # Create empty string to store city names
        hashs = [] # Create list to store Hastags per tweet
        for words in tweet: # Goto each word in the tweet
            if words in muni: # if word is in Municipality Dict Keys 
                city1 = str(mun_dict[words]) # Then Store the City for the key
            if '#' in words: # if word contains a Hashtag
                words = words.lower() # Store Word as lower case
                hashs.append(words) # append word to list of hashs for that tweet
        cities.append(city1) # Store city Name and append to list for each tweet
        hashtags.append(hashs) # Store hastags per tweet and append to list
    
    cities = [np.nan if x == '' else x for x in cities] # Replace empty string with np.nan
    df1['municipality'] = cities # Insert
    df1['hashtags'] = hashtags #Insert
    df1['hashtags'] = df1['hashtags'].apply(lambda y: np.nan if len(y)==0 else y) # Replace empty lists(no hashtags) with np.nan
    
    return df1 
### END FUNCTION

In [12]:
extract_municipality_hashtags(twitter_df.copy())

Unnamed: 0,Tweets,Date,municipality,hashtags
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54,,
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53,,
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10,,
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36,,
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43,,"[#eskomfreestate, #mediastatement]"
...,...,...,...,...
195,Eskom's Visitors Centres’ facilities include i...,2019-11-20 10:29:07,,
196,#Eskom connected 400 houses and in the process...,2019-11-20 10:25:20,,"[#eskom, #eskom, #poweringyourworld]"
197,@ArthurGodbeer Is the power restored as yet?,2019-11-20 10:07:59,,
198,@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...,2019-11-20 10:07:41,,


## Function 5: Number of Tweets per Day

This function calculates the number of tweets that were posted per day. 

**Function Specifications:**
- It should take a pandas dataframe as input.

In [14]:
### START FUNCTION
def number_of_tweets_per_day(df):
    # your code here
    df1 = df.copy() # Make a Copy Of DataFrame
    dates = list(df1['Date']) # Extract List of Dates from Copied Dataframe
    dates_only = [] # Intialize empty list to store only dates from datetime strings
    for date in dates: # Start loop - Loop through every datetime in list called dates
        temp = date[0:10] # Extract only date from datetime
        dates_only.append(temp) # Append each date in loop to empty list intialized above
        
    data = pd.DataFrame() # Create empty Dataframe called data
    data['Date'] = dates_only # Input Dates in First Column
    data['Tweets'] = 1 # Input Column containing only 1's for groupyby sum
    data = data.groupby(['Date']).sum() # Grouby Date and get sum of Tweets per Date
    return data
### END FUNCTION

In [15]:
number_of_tweets_per_day(twitter_df.copy())

Unnamed: 0_level_0,Tweets
Date,Unnamed: 1_level_1
2019-11-20,18
2019-11-21,11
2019-11-22,25
2019-11-23,19
2019-11-24,14
2019-11-25,20
2019-11-26,32
2019-11-27,13
2019-11-28,32
2019-11-29,16


## Function 6: Word Splitter

This function splits the sentences in a dataframe's column into a list of the separate words. The created lists should be placed in a column named `'Split Tweets'` in the original dataframe. This is also known as [tokenization](https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/).

**Function Specifications:**
- It should take a pandas dataframe as an input.
- The dataframe should contain a column, named `'Tweets'`.
- The function should split the sentences in the `'Tweets'` into a list of seperate words, and place the result into a new column named `'Split Tweets'`. The resulting words must all be lowercase!

In [17]:
### START FUNCTION
def word_splitter(df):
    # your code here
    df1 = df.copy() # Make Copy Of Dataframe
    Tweets = list(df1['Tweets']) # Place Tweets in List
    Tweets_Split = [] # Empty List for appending lists of words for each tweet
    for Tweet in Tweets: # Loop to go through every Tweet in list
        Tweets_Split.append(Tweet.lower().split()) # Append the split words to empty list created above
    df1['Split Tweets'] = Tweets_Split # Insert list of lists where sublists contain splitwords into dataframe
    return df1
### END FUNCTION

In [18]:
word_splitter(twitter_df.copy())

Unnamed: 0,Tweets,Date,Split Tweets
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54,"[@bongadlulane, please, send, an, email, to, m..."
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53,"[@saucy_mamiie, pls, log, a, call, on, 0860037..."
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10,"[@bongadlulane, query, escalated, to, media, d..."
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36,"[before, leaving, the, office, this, afternoon..."
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43,"[#eskomfreestate, #mediastatement, :, eskom, s..."
...,...,...,...
195,Eskom's Visitors Centres’ facilities include i...,2019-11-20 10:29:07,"[eskom's, visitors, centres’, facilities, incl..."
196,#Eskom connected 400 houses and in the process...,2019-11-20 10:25:20,"[#eskom, connected, 400, houses, and, in, the,..."
197,@ArthurGodbeer Is the power restored as yet?,2019-11-20 10:07:59,"[@arthurgodbeer, is, the, power, restored, as,..."
198,@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...,2019-11-20 10:07:41,"[@muthambipaulina, @sabcnewsonline, @iol, @enc..."


## Function 7: Stop Words

This function removes english stop words from a tweet.

**Function Specifications:**
- It should take a pandas dataframe as input.
- The function should modify the input dataframe.
- The function should return the modified dataframe.

In [20]:
### START FUNCTION
def stop_words_remover(df):
    # your code here
    Tweets_new = []  #empty list to store tweets
    Tweets = list(df['Tweets']) #store tweets in the list
    Tweets_Split = [] #Empty List for appending lists of words for each tweet
    for Tweet in Tweets: #Loop to go through each an every Tweet in list
        Tweets_Split.append(Tweet.lower().split()) # Append the split words to empty list created above
        for Tweets in Tweets_Split: # Loop to through each tweet in split tweets list
            x = Tweets
            for item in x: # Go through each item in each tweet
                if item in stop_words_dict['stopwords']: # Chech if item is in stopwords dictionary
                    x.remove(item) # if it is remove the item from list of split words per tweets
        Tweets_new.append(x)      
    df['Without Stop Words'] = Tweets_new # Insert list of lists where sublists contain splitwords without stopwords into dataframe
    return df
### END FUNCTION

In [21]:
stop_words_remover(twitter_df.copy())

Unnamed: 0,Tweets,Date,Without Stop Words
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54,"[@bongadlulane, send, email, mediadesk@eskom.c..."
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53,"[@saucy_mamiie, pls, log, 0860037566]"
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10,"[@bongadlulane, query, escalated, media, desk.]"
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36,"[leaving, office, afternoon,, heading, weekend..."
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43,"[#eskomfreestate, #mediastatement, :, eskom, s..."
...,...,...,...
195,Eskom's Visitors Centres’ facilities include i...,2019-11-20 10:29:07,"[eskom's, visitors, centres’, facilities, incl..."
196,#Eskom connected 400 houses and in the process...,2019-11-20 10:25:20,"[#eskom, connected, 400, houses, process, conn..."
197,@ArthurGodbeer Is the power restored as yet?,2019-11-20 10:07:59,"[@arthurgodbeer, power, restored, yet?]"
198,@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...,2019-11-20 10:07:41,"[@muthambipaulina, @sabcnewsonline, @iol, @enc..."
