# Analyse - Predict

Functions are important in reducing the replication of code as well as giving the user the functionality of getting an ouput on varying inputs. The functions you will write all use Eskom data/variables.

## Instructions to Students
- **Do not add or remove cells in this notebook. Do not edit or remove the `### START FUNCTION` or `### END FUNCTION` comments. Do not add any code outside of the functions you are required to edit. Doing any of this will lead to a mark of 0%!**
- Answer the questions according to the specifications provided.
- Use the given cell in each question to to see if your function matches the expected outputs.
- Do not hard-code answers to the questions.
- The use of stackoverflow, google, and other online tools are permitted. However, copying fellow student's code is not permissible and is considered a breach of the Honour code. Doing this will result in a mark of 0%.
- Good luck, and may the force be with you!

## Imports

In [1]:
import pandas as pd
import numpy as np

## Data Loading and Preprocessing

### Electricification by province (EBP) data

In [2]:
ebp_url = 'https://raw.githubusercontent.com/Explore-AI/academy-content/master/courses/data-science/fundamentals/analyse/predict/data/electrification_by_province.csv?token=AONJI6P3AI6HFYIHHVX2FDK6IKGFU'
ebp_df = pd.read_csv(ebp_url)

for col, row in ebp_df.iloc[:,1:].iteritems():
    ebp_df[col] = ebp_df[col].str.replace(',','').astype(int)

ebp_df.head()

HTTPError: HTTP Error 404: Not Found

### Twitter data

In [13]:
twitter_url = 'https://raw.githubusercontent.com/Explore-AI/academy-content/master/courses/data-science/fundamentals/analyse/predict/data/twitter_nov_2019.csv?token=AONJI6ONBTLJJLGK34JLTXK6IKGNI'
twitter_df = pd.read_csv(twitter_url)
twitter_df.head()

Unnamed: 0,Tweets,Date
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43


## Important Variables (Do not edit these!)

In [173]:
# gauteng ebp data as a list
gauteng = ebp_df['Gauteng'].astype(float).to_list()

# dates for twitter tweets
dates = twitter_df['Date'].to_list()

# dictionary mapping official municipality twitter handles to the municipality name
mun_dict = {
    '@CityofCTAlerts' : 'Cape Town',
    '@CityPowerJhb' : 'Johannesburg',
    '@eThekwiniM' : 'eThekwini' ,
    '@EMMInfo' : 'Ekurhuleni',
    '@centlecutility' : 'Mangaung',
    '@NMBmunicipality' : 'Nelson Mandela Bay',
    '@CityTshwane' : 'Tshwane'
}

# dictionary of english stopwords
stop_words_dict = {
    'stopwords':[
        'where', 'done', 'if', 'before', 'll', 'very', 'keep', 'something', 'nothing', 'thereupon', 
        'may', 'why', 'â€™s', 'therefore', 'you', 'with', 'towards', 'make', 'really', 'few', 'former', 
        'during', 'mine', 'do', 'would', 'of', 'off', 'six', 'yourself', 'becoming', 'through', 
        'seeming', 'hence', 'us', 'anywhere', 'regarding', 'whole', 'down', 'seem', 'whereas', 'to', 
        'their', 'various', 'thereafter', 'â€˜d', 'above', 'put', 'sometime', 'moreover', 'whoever', 'although', 
        'at', 'four', 'each', 'among', 'whatever', 'any', 'anyhow', 'herein', 'become', 'last', 'between', 'still', 
        'was', 'almost', 'twelve', 'used', 'who', 'go', 'not', 'enough', 'well', 'â€™ve', 'might', 'see', 'whose', 
        'everywhere', 'yourselves', 'across', 'myself', 'further', 'did', 'then', 'is', 'except', 'up', 'take', 
        'became', 'however', 'many', 'thence', 'onto', 'â€˜m', 'my', 'own', 'must', 'wherein', 'elsewhere', 'behind', 
        'becomes', 'alone', 'due', 'being', 'neither', 'a', 'over', 'beside', 'fifteen', 'meanwhile', 'upon', 'next', 
        'forty', 'what', 'less', 'and', 'please', 'toward', 'about', 'below', 'hereafter', 'whether', 'yet', 'nor', 
        'against', 'whereupon', 'top', 'first', 'three', 'show', 'per', 'five', 'two', 'ourselves', 'whenever', 
        'get', 'thereby', 'noone', 'had', 'now', 'everyone', 'everything', 'nowhere', 'ca', 'though', 'least', 
        'so', 'both', 'otherwise', 'whereby', 'unless', 'somewhere', 'give', 'formerly', 'â€™d', 'under', 
        'while', 'empty', 'doing', 'besides', 'thus', 'this', 'anyone', 'its', 'after', 'bottom', 'call', 
        'nâ€™t', 'name', 'even', 'eleven', 'by', 'from', 'when', 'or', 'anyway', 'how', 'the', 'all', 
        'much', 'another', 'since', 'hundred', 'serious', 'â€˜ve', 'ever', 'out', 'full', 'themselves', 
        'been', 'in', "'d", 'wherever', 'part', 'someone', 'therein', 'can', 'seemed', 'hereby', 'others', 
        "'s", "'re", 'most', 'one', "n't", 'into', 'some', 'will', 'these', 'twenty', 'here', 'as', 'nobody', 
        'also', 'along', 'than', 'anything', 'he', 'there', 'does', 'we', 'â€™ll', 'latterly', 'are', 'ten', 
        'hers', 'should', 'they', 'â€˜s', 'either', 'am', 'be', 'perhaps', 'â€™re', 'only', 'namely', 'sixty', 
        'made', "'m", 'always', 'those', 'have', 'again', 'her', 'once', 'ours', 'herself', 'else', 'has', 'nine', 
        'more', 'sometimes', 'your', 'yours', 'that', 'around', 'his', 'indeed', 'mostly', 'cannot', 'â€˜ll', 'too', 
        'seems', 'â€™m', 'himself', 'latter', 'whither', 'amount', 'other', 'nevertheless', 'whom', 'for', 'somehow', 
        'beforehand', 'just', 'an', 'beyond', 'amongst', 'none', "'ve", 'say', 'via', 'but', 'often', 're', 'our', 
        'because', 'rather', 'using', 'without', 'throughout', 'on', 'she', 'never', 'eight', 'no', 'hereupon', 
        'them', 'whereafter', 'quite', 'which', 'move', 'thru', 'until', 'afterwards', 'fifty', 'i', 'itself', 'nâ€˜t',
        'him', 'could', 'front', 'within', 'â€˜re', 'back', 'such', 'already', 'several', 'side', 'whence', 'me', 
        'same', 'were', 'it', 'every', 'third', 'together'
    ]
}

# Function 6: Word Splitter

Write a function which splits the sentences in a dataframe's column into a list of the separate words. The created lists should be placed in a column named `'Split Tweets'` in the original dataframe. This is also known as [tokenization](https://www.geeksforgeeks.org/nlp-how-tokenizing-text-sentence-words-works/).

**Function Specifications:**
- It should take a pandas dataframe as an input.
- The dataframe should contain a column, named `'Tweets'`.
- The function should split the sentences in the `'Tweets'` into a list of seperate words, and place the result into a new column named `'Split Tweets'`. The resulting words must all be lowercase!
- The function should modify the input dataframe directly.
- The function should return the modified dataframe.

In [171]:
### START FUNCTION
def word_splitter(df):
    # your code here
    
    df_split = df['Tweets'][:]
    
    df_list_split = [token.lower().split() for token in df_split]
    
    df['Split Tweets'] = df_list_split
    
    return df

### END FUNCTION

In [172]:
word_splitter(twitter_df.copy())

Unnamed: 0,Tweets,Date,Split Tweets
0,@BongaDlulane Please send an email to mediades...,2019-11-29 12:50:54,"[@bongadlulane, please, send, an, email, to, m..."
1,@saucy_mamiie Pls log a call on 0860037566,2019-11-29 12:46:53,"[@saucy_mamiie, pls, log, a, call, on, 0860037..."
2,@BongaDlulane Query escalated to media desk.,2019-11-29 12:46:10,"[@bongadlulane, query, escalated, to, media, d..."
3,"Before leaving the office this afternoon, head...",2019-11-29 12:33:36,"[before, leaving, the, office, this, afternoon..."
4,#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...,2019-11-29 12:17:43,"[#eskomfreestate, #mediastatement, :, eskom, s..."
...,...,...,...
195,Eskom's Visitors Centres’ facilities include i...,2019-11-20 10:29:07,"[eskom's, visitors, centres’, facilities, incl..."
196,#Eskom connected 400 houses and in the process...,2019-11-20 10:25:20,"[#eskom, connected, 400, houses, and, in, the,..."
197,@ArthurGodbeer Is the power restored as yet?,2019-11-20 10:07:59,"[@arthurgodbeer, is, the, power, restored, as,..."
198,@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...,2019-11-20 10:07:41,"[@muthambipaulina, @sabcnewsonline, @iol, @enc..."


In [16]:
word_splitter(twitter_df.copy())

_**Expected Output**_:

```python

word_splitter(twitter_df.copy()) 

```

> <table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Tweets</th>
      <th>Date</th>
      <th>Split Tweets</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>@BongaDlulane Please send an email to mediades...</td>
      <td>2019-11-29</td>
      <td>[@bongadlulane, please, send, an, email, to, m...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>@saucy_mamiie Pls log a call on 0860037566</td>
      <td>2019-11-29</td>
      <td>[@saucy_mamiie, pls, log, a, call, on, 0860037...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>@BongaDlulane Query escalated to media desk.</td>
      <td>2019-11-29</td>
      <td>[@bongadlulane, query, escalated, to, media, d...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Before leaving the office this afternoon, head...</td>
      <td>2019-11-29</td>
      <td>[before, leaving, the, office, this, afternoon...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...</td>
      <td>2019-11-29</td>
      <td>[#eskomfreestate, #mediastatement, :, eskom, s...</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>195</th>
      <td>Eskom's Visitors Centresâ€™ facilities include i...</td>
      <td>2019-11-20</td>
      <td>[eskom's, visitors, centresâ€™, facilities, incl...</td>
    </tr>
    <tr>
      <th>196</th>
      <td>#Eskom connected 400 houses and in the process...</td>
      <td>2019-11-20</td>
      <td>[#eskom, connected, 400, houses, and, in, the,...</td>
    </tr>
    <tr>
      <th>197</th>
      <td>@ArthurGodbeer Is the power restored as yet?</td>
      <td>2019-11-20</td>
      <td>[@arthurgodbeer, is, the, power, restored, as,...</td>
    </tr>
    <tr>
      <th>198</th>
      <td>@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...</td>
      <td>2019-11-20</td>
      <td>[@muthambipaulina, @sabcnewsonline, @iol, @enc...</td>
    </tr>
    <tr>
      <th>199</th>
      <td>RT @GP_DHS: The @GautengProvince made a commit...</td>
      <td>2019-11-20</td>
      <td>[rt, @gp_dhs:, the, @gautengprovince, made, a,...</td>
    </tr>
  </tbody>
</table>

# Function 7: Stop Words

Write a function which removes english stop words from a tweet.

**Function Specifications:**
- It should take a pandas dataframe as input.
- Should tokenise the sentences according to the definition in function 6. Note that function 6 **cannot be called within this function**.
- Should remove all stop words in the tokenised list. The stopwords are defined in the `stopwords_dict` variable defined at the top of this notebook.
- The resulting tokenised list should be placed in a column named `"Without Stop Words"`.
- The function should modify the input dataframe.
- The function should return the modified dataframe.


In [535]:
### START FUNCTION
def stop_words_remover(df):
    # your code here

    stop_words = stop_words_dict.values()
    
    new_list = []
    
    for lst in stop_words:
        new_list = new_list + lst
       
        
    #df['Without Stop Words'] = list_stop
    def inner_fun(tweets):
        lst =  [word for word in tweets.lower().split() if word not in new_list]
        return lst
    
    df['Without Stop Words'] = df['Tweets'].apply(inner_fun)
    
    return df

### END FUNCTION

In [18]:
stop_words_remover(twitter_df.copy())

_**Expected Output**_:

```python
stop_words_remover(twitter_df.copy())
```

> <table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Tweets</th>
      <th>Date</th>
      <th>Without Stop Words</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>@BongaDlulane Please send an email to mediades...</td>
      <td>2019-11-29</td>
      <td>[@bongadlulane, send, email, mediadesk@eskom.c...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>@saucy_mamiie Pls log a call on 0860037566</td>
      <td>2019-11-29</td>
      <td>[@saucy_mamiie, pls, log, 0860037566]</td>
    </tr>
    <tr>
      <th>2</th>
      <td>@BongaDlulane Query escalated to media desk.</td>
      <td>2019-11-29</td>
      <td>[@bongadlulane, query, escalated, media, desk.]</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Before leaving the office this afternoon, head...</td>
      <td>2019-11-29</td>
      <td>[leaving, office, afternoon,, heading, weekend...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>#ESKOMFREESTATE #MEDIASTATEMENT : ESKOM SUSPEN...</td>
      <td>2019-11-29</td>
      <td>[#eskomfreestate, #mediastatement, :, eskom, s...</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>195</th>
      <td>Eskom's Visitors Centresâ€™ facilities include i...</td>
      <td>2019-11-20</td>
      <td>[eskom's, visitors, centresâ€™, facilities, incl...</td>
    </tr>
    <tr>
      <th>196</th>
      <td>#Eskom connected 400 houses and in the process...</td>
      <td>2019-11-20</td>
      <td>[#eskom, connected, 400, houses, process, conn...</td>
    </tr>
    <tr>
      <th>197</th>
      <td>@ArthurGodbeer Is the power restored as yet?</td>
      <td>2019-11-20</td>
      <td>[@arthurgodbeer, power, restored, yet?]</td>
    </tr>
    <tr>
      <th>198</th>
      <td>@MuthambiPaulina @SABCNewsOnline @IOL @eNCA @e...</td>
      <td>2019-11-20</td>
      <td>[@muthambipaulina, @sabcnewsonline, @iol, @enc...</td>
    </tr>
    <tr>
      <th>199</th>
      <td>RT @GP_DHS: The @GautengProvince made a commit...</td>
      <td>2019-11-20</td>
      <td>[rt, @gp_dhs:, @gautengprovince, commitment, e...</td>
    </tr>
  </tbody>
</table>