# Week 7 - Data Science

## Drill

__Exercise:__ Write a function `hasvowel()` which accepts a word input and returns `True` if the word contains vowels. Think about the requirements of the argument and raise them if incorrect. `string.alpha()` might be helpful. 

In [None]:
# Solution
import string   # or from string import isalpha

'''
If the argument is not a word (i.e. with spaces or non-letter character/ not a string), then raise ValueError. 
'''
def hasvowel(word): 
    if type(word) != str: 
        raise ValueError
    # Check spaces and non-letter character
    if word.isalpha() == False: 
        raise ValueError
    
    if ('a' or 'e' or 'i' or 'o' or 'u') in word:
        return True   # Otherwise you will obtain False already
        

In [None]:
# Your code below


__Exercise:__ In the following, print out the results (__in 2 decimal points__) using the `.format()` method. The computation is given to you. 

In [7]:
# Solution
from random import randint
result = randint(0,122)/100

print('{0:.2f}'.format(result))

1.04


In [None]:
from random import randint
result = randint(0,122)/100

# Your code below


__Exercise:__ Design a pandas dataframe which has the following specification. 
* `id`: Integers of 8 digits, starting by `10000000`.
* `name`: String characters.

After this, run your code and show the first 5 records of the dataframe you have created. 

In [7]:
# Solution
import pandas as pd
df = pd.DataFrame({'id':[10000000,10000001,10000002,10000003,10000004],'name':['Shelly','Kent','Martin','John','Myra']})
print(df.head(5))

         id    name
0  10000000  Shelly
1  10000001    Kent
2  10000002  Martin
3  10000003    John
4  10000004    Myra


In [None]:
# Your code below


## Data Science Process

In data analytics, there is a standard procedure where data scientists follow to commit a data science project. This is called __CRISP-DM__. It follows from generating the ideas, finding the ingredients to produce a data science product and report them. The following is the flowchart of the process. 

In [9]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="fig/CRISPDM_Process_Diagram.png", width=620)

# Image from https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome

## Setting Your Goal

In the first phase, the organisation understands the project objectives and the business requirements. Often they will summarise them in documents. This is the __business understanding__ phase. This is not contrainted to business, any organisation or __you__ have a goal, and this goal is where it guides you to the road of data analytics. 

__Exercise:__ Think about in your career, what is your objective? Think about how this object require data to help you. 

## Data Understanding

To start the data science project, we need to start with the relevant data sources. This is part of the __data understanding__ process. This involves the following steps: 
* Sourcing data
* Exploratory data analysis (EDA)

### Sourcing Data from Twitter

Twitter is a popular social media site where users can express themselves within some text length. It attracts a lot of attention from researchers to understand how opinion forms on Twitter. To convenient people understand what is inside the social network, Twitter offers an API that allows us to obtain the information in their server. 

API is the gateway to their servers. This means we need to obtain a key to get access. This is called the __public-key cryptography__. Which means we use the public key to get access into Twitter's server, and Twitter has a private key to encrypt their message back to us. This is only important when we are using the pair of keys as in below. 

So before to start anything, we should have a Twitter developer account beforehand. This can be obtained from [https://developer.twitter.com/](https://developer.twitter.com/). 

After that, you will need to obtain your __consumer keys__ and __access tokens__ (i.e. the public keys) on the developer portal. There are many guides available online, for example
* [https://towardsdatascience.com/how-to-access-twitters-api-using-tweepy-5a13a206683b](https://towardsdatascience.com/how-to-access-twitters-api-using-tweepy-5a13a206683b)
* [https://realpython.com/twitter-bot-python-tweepy/#creating-twitter-api-authentication-credentials](https://realpython.com/twitter-bot-python-tweepy/#creating-twitter-api-authentication-credentials)

Once you have created an application, you will need Twitter officially approve you. This can take from hours to few days. 

To fetch tweets from Twitter, we get help from the `tweepy` package. The following code is would fetch the relevant tweets to the keyword `'pokemon'` (last line of the snippet). 
```python
import tweepy

APP_KEY = '***'
APP_SECRET = '***'

auth = tweepy.OAuthHandler('***', '***')
auth.set_access_token(APP_KEY, APP_SECRET)

api = tweepy.API(auth)

tweets = api.search(q = "pokemon", count = 100, result_type = "recent")
```
As you can see, we use 2 pairs of keys. The `APP_KEY` represents our access to the server, and then we access the session token which allows us to stay for the time being.  

`tweepy` provides a lot of methods that could enable programmers to obtain relevant tweets. For example the `search()` method returns relevant tweets from a keyword. `search_users()` returns relevant users from a keyword. 

More of these methods, and how to use them, can be find in [https://tweepy.readthedocs.io/en/latest/api.html](https://tweepy.readthedocs.io/en/latest/api.html). 

__Exercise:__ What does each line of the code mean? If possible, copy the snippet and comment out each lines. 

In today, we are going to fetch some of the tweets and write them in json format. Json is a way to store complex data in text files and the program can recognise them afterwards. In python there is a native package called `json`, so we will call them at the start. 
```python
import json
```
To convert the information to json, we write 
```python
tweet_json = json.dump(tweets._json)
```
Thus we will need to store the data, this means we can write 
```python
with open('tweets.json', 'w') as json_file:
    json.dump(tweet_json, json_file)
```
and the tweet data is now stored in `tweets.json` in your current folder location. 

To read back the json file, we can use the following snippet: 
```python
with open('data.txt') as json_file:
    data = json.load(json_file)
```

__Exercise:__ Write a code that will return tweets with the keyword `'Apple'`. Store them into the file called `apple_tweets.txt'. 

In [None]:
# Solution
import tweepy

APP_KEY = '***'
APP_SECRET = '***'

auth = tweepy.OAuthHandler('***', '***')
auth.set_access_token(APP_KEY, APP_SECRET)

api = tweepy.API(auth)

tweets = api.search(q = "Apple", count = 100, result_type = "recent")

with open('apple_tweets.txt', 'w') as json_file:
    json.dump(tweets, json_file)

In [None]:
# Your code below


As an extension, you can use a Twitter scraper to do more complex scraping. Given that one method call can fetch up to 1500 tweets, you can write a code to find tweets within a date range, and return more than 3200 at once. There are many of these online and often the authors have submitted theirs on thei Github repository. 

### Web Scraping

Internet contains a wealth of resources, so we can use python to extract information from the webpage. To scrape from the internet, we follow the following steps: 
1. Find the page that you want to scrape
2. Inspect the page
3. Write the code
4. Run the code and extract the data
5. Store the data in the required format 

In this exercise, we will try to scrape from news website. Our example would be from [https://www.nbcnews.com/better/health/what-headline-stress-disorder-do-you-have-it-ncna830141](https://www.nbcnews.com/better/health/what-headline-stress-disorder-do-you-have-it-ncna830141). Which we wish to obtain the article itself. 

This means we should have a look at the page itself, by means of clicking in the link and inspect them. You should see the similar screenshot as below. 

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="fig/Capture01.png", width=620)

You will see the webapage and it has not only just the news article itself. In well structured websites, there is a top bar to navigate to other pages of the news site. Often there will be some advertisements at the side of the article. If you scroll down, you will see the footer which shows the side information of the website as well. The article does not span the whole space in the middle of the page. There are spaces to put the author's information and the date of publish. 

__Exercise:__ Identify the page elements in the link. 

Our first step is to fetch the webpage onto our computer. In this way, we could potentially extract the information from the page. In this exercise, we will look at a package called `requests`. The following is the snippet of code to extract the website and export the information. 

```python
import requests

htmlsource = requests.get('https://www.nbcnews.com/better/health/what-headline-stress-disorder-do-you-have-it-ncna830141').text
tree = html.fromstring(htmlsource)

with open('file/stress_news.html', mode = 'w') as fo:
    fo.write(htmlsource)

fo.close()
```

To extract the contents, there is a method to do so. Webpages are composed under a markup language called HTML. Each semantic contents are wrapped by a tag. For example, to make text bold, we can write `<b>I am bold.</b>` wich shows __I am bold.__ The bold find starts at the `<b>` tag and ends at `</b>` tag. A typical webpage should show 
```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
   <head>
      <title>My first HTML document</title>
   </head>
   <body>
      <p>Hello world!</p>
   </body>
</html>
```
Let us look at the tags in more detail: 
* The first line tells the computer that this is a webpage. 
* The `<head>` tag shows the meta information of the webpage. In this case, we define the title of the page under the `<head>` tag. 
* Anything showed in the `<body>` tag will appear on the screen (of your browser). In this case, there is a sentence 'Hello world!' is written and it is wrapped by the paragraph `<p>` tag. 
* Finally, every tags should wrap up by a end tag that starts with a `/`. For example, a paragrapha ends with a `</p>` tag. 

We cannot assume that our contents are spans within the whole `<body>` tag, so we will inspect it. To do so, at the page opened press __F12__. You should see a portal appeared left to your page with the HTML markups showed. 

Then scroll to the HTML markups and scroll to the place where the article is highlighting. This is not the place contains only the article, but there is the smallest tag contained the article. So click on the triangle beside the tag, and scroll to where the article is highlighted again. Repeat this process until you are able to see only the article contents are highlighted. 

__Exercise:__ What is the HTML tag that contains the whole article? You may want to include all the information inside the tag. 

It is a `<div>` tag and the whole tag reads 
```html
<div class="article-body__content">
```

Once you read the tag that contains all the information of the article itself, you should see it contains several `<p>` tags, these mean the paragraphs and you don't need to care about them. 

To extract the article text, the HTML tag we have is important. If you do not have the answer, just use the solution from above exercise. We will extract the article using this information. Given we have the destination tag, how could we tell the computer where it is? We use a concept called __XPath__ (XML Path Language). It shows the address from the top most level to the tag we after. 

To get the XPath of the article, right click on the tag we after and click copy the 'XPath' or the 'Full XPath'. 

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="fig/Capture02.png", width=480)

For example, the full XPath that stores the article itself is 
`/html/body/div[2]/div/div[5]/div/div/div/article/div/div[2]/div[1]/div[2]`, while the XPath is `//*[@id="content"]/div/div[5]/div/div/div/article/div/div[2]/div[1]/div[2]`. We can see that `html` appears at the start of the XPath, then the `body`. Which appears to be the nested structure of the html tags. 

Furthermore, 
* `/` seperates the different levels. For example, `/html/body/` means the top layer is the `<html>` tag and the `<body>` is nested.
* `[]` indicates which branch of the tree is chosen. For instance, if there are 2 `<div>` tags after `<body>`, `div[2]` would select the second (i.e. last) one.
* `//` means 'arbitrary depth'. For example, `/html/body//p` means any paragraphs `<p>` under `<body`. Trilling (2019) adivses to 'start your XPATH with `//` to avoid make it shorter and avoid being too specific'.
* `*` means 'everything'. For instance, `p[*]` means all paragraphs. 
* `//*` means everything in the next layer. 
* `[@attribute="whatever"]` lets you select only those tags that contain a specific attribute. Common attributes could be `id`, `class`. Using above exercise as an example, if the `<div>` tag reads `<div class="reviews-single text">`. The XPath could write 
`//div[@class="article-body__content"]`. 
* `.` (a dot) means relative location path from the tag we after. 

This means we can fetch (or _parse_) the article by specifying the XPath. Which is done by the following code snippet. 
```python
from lxml import html

article_content = tree.xpath('//*[@id="content"]/div/div[5]/div/div/div/article/div/div[2]/div[1]/div[2]')

article_content_text = [r.text.strip() for r in article_content]
print(article_content_text)
```

__Exercise:__ Use the snippets above to print out the texts of the article. If possible, write the texts into a `.txt` file. 

__Exercise:__ Use the snippets above to extract the positive words in the Havard IV-4 dictionary. After that write the list of words into a `.txt` file. 

This is the link: `http://www.wjh.harvard.edu/~inquirer/Pstv.html`

### EDA

_Some of contents come from https://github.com/info-370/eda._

Exploratory data analysis (__EDA__) is the process to understand the datasets sourced. This process starts by summarising the data structures and then visualise a selected features to quickly understand univariate distributions and relationships between variables. Initial EDA questions ask basic questions, including:

* How large is the dataset (rows, columns)?
* What are the variables present in the dataset?
* What is the data type of each variable?

In this week, we will start some work on EDA with a marks dataset and a text dataset. You can source your own dataset as well. 

When using Jupyter to analyse data, we should maintain our habit on grouping each functions into each cell. For example, a conventional layout should include (in order): 
1. Packages/ libraries needed to be imported
2. Custom functions you wish to use
3. Import data
4. EDA
5. Analysis
6. Report on your findings

In this way it is organised, and also it is intuitive to later users to compile the early cells first so that there will be no not found errors. In the following, we have a dedicated cell for storing the packages. When we need to import a library, we will add into the next cell and rerun it. 

In [None]:
# Import all needed packages in here
import numpy as np
import pandas as pd




In this exercise, we will use a sample marks data set called `marks.csv`. It is stored in the `/files` folder. 

__Exericse:__ Read the `marks.csv` file and import the data as a `pandas` dataframe. 

In [None]:
# Your code below
df = pd.read_csv('???')

Sometimes the input `.csv` file is encoded by different systems. For example, if the `.csv` file is written in Chinese or other languages, it may use unknown encoding systems to us. Sometimes, we need to add an option to the `.read_csv()` to let it read the Unicode. For example, 

```python
df = pd.read_csv('???', encoding='utf8')
```

If you see the encoding error and you cannot figure out which encoding system it uses. Run the following code and it will help you. Then use the code snippet above. 

In [None]:
# This code checks what encoding the csv file has. 
with open('top50.csv') as f:
    print(f)

In this exercise, we don't need to worry about the encoding. 

So let us start exploring the data set. This means we need to print out the features inside the dataset. In the following, we are using the methods from `pandas` since last week. Which would be familiar to you. 

__Exercise:__ How can we see the first __10 records__ of the dataset?

In [None]:
# Solution
print(df.head(10))

In [None]:
# Your code below


__Exercise:__ How can we know what columns are in the data set?

In [None]:
# Solution
print(df.columns)

In [None]:
# Your code below


__Exercise:__ How can we find out how many rows and columns we have?

In [None]:
# Solution
print(df.shape)

In [None]:
# Your code below

__Exericse:__ Given that we know how to find how many rows and columns we have in teh dataset? I would like to see the results printed as `'There are {} records and {} columns in this dataset.'` where the brackets are substituted by the number of rows and columns in the dataset. 

__Hint:__ Use the code above and think about how come the returned value able to fit in 2 values. What is the data type we obtained? So how can we extract the individual values? 

In [None]:
# Solution
print('There are {} records and {} columns in this dataset.'.format(df.shape[0],df.shape[1]))

In [None]:
# Your code below


__Exercise:__ How can we find out the data types under each column? Are they all the same under each columns?

In [None]:
# Solution
print(df.dtypes)

In [None]:
# Your code below


At the end of your EDA, we will start to get some statistical information from the dataset. To do so, we rely on the native `.describe()` method from `pandas`. Run the code below to see how the amrks distribution would be. 

In [None]:
# Run this code
df.describe()

So let us start with the results row by row. 
* The first column tells us how many (non null) records do we have in this dataset. Given that all counts are the same, we can say there are no missing (null) values. However, we can check that from the `.shape()` method previously. 
* The next 2 rows represents the mean and standard deviation of each questions. It would be meaningful to interpret these results given we have the total marks of each question. 
* The next rows represent their quantiles. That is, for example, how does the lowest 25% perform? We can have a percetion on where the cohort perform. Are they doing good in this question, or not?

However, if we look at the last row. There is something wrong in here. If you look carefully, you should see that the maximum marks are 0.5 more than what supposed to be. We need to find out how many people have this mark and then rectify the situation. 

To find out the how many people have thsi problem, we use the followin code: 

In [None]:
print('=== Excessive marks awarded ===')
print('Q1:',df.loc[df.Q1 == 20.5, 'Q1'].count(), 'students')
print('Q2:',df.loc[df.Q2 == 20.5, 'Q2'].count(), 'students')
print('Q3:',df.loc[df.Q3 == 15.5, 'Q3'].count(), 'students')
print('Q4:',df.loc[df.Q4 == 20.5, 'Q4'].count(), 'students')
print('Q5:',df.loc[df.Q5 == 15.5, 'Q5'].count(), 'students')
print('Q6:',df.loc[df.Q6 == 10.5, 'Q6'].count(), 'students')

How would you able to diagnose the problem? Think about a cause of this incorrect data entry and write a code to see why. 

The reason why this happens is becuase when this dataset was generated, it was based on a code that aimed to generate `0.5` marks. However, this will cause the the problem above. 

In [None]:
# Your code below


We know that not many people have this problem, that is fine. For the purpose of this exercise, let us convert the marks into full marks. This is the code to do that. 

In [None]:
df.loc[df.Q1 == 20.5, 'Q1'] = 20
df.loc[df.Q2 == 20.5, 'Q2'] = 20
df.loc[df.Q3 == 15.5, 'Q3'] = 15
df.loc[df.Q4 == 20.5, 'Q4'] = 20
df.loc[df.Q5 == 15.5, 'Q5'] = 15
df.loc[df.Q6 == 10.5, 'Q6'] = 10

Can you explain what does the code mean? You might need to see the `.loc` documentations in `pandas`. 

### EDA with Text Files

In these 2 weeks we will also look at how to infer from text files. The key of analysing text data sets is to encode them quantitatively. 

In [None]:
df.hist()

In [None]:
print(df['Track.Name'].unique())

In [None]:
print(df['Artist.Name'].unique())
print('\n=== Tally ===')
print(df['Artist.Name'].value_counts())
print('\n')
print('There are {} artists on Spotify top 50.'.format(df['Artist.Name'].unique().shape[0]))

In [None]:
print(df['Artist.Name'].unique())

```python
from matplotlib import pyplot as plt
%matplotlib inline
```

In [None]:
plt.rcParams["figure.figsize"] = (20,10)
plt.bar(df['Artist.Name'].value_counts().to_dict().keys(),df['Artist.Name'].value_counts().to_dict().values())
plt.xticks(rotation=90)
plt.show()

Now we have explored most of the data set. We will start find the correlations between different datasets. 
```python
import seaborn as sns
sns.set(style="ticks")
```

In [None]:
sns.pairplot(df)

### Data Cleaning

It also involves data cleaning means to find out the outliers, for example 
* Which data points has value too big?
* Which data points are ilogical?
* How many missing data do we have?

Often this is done once we have a dataset. For the purose of this week, it is done in the next step. In pratice, data cleaning has to be done when we obtain a new piece of dataset. In the following, let us use some of the templates to clean the dataset. 

## Data Preparation

In data science, understanding data initiates the project. It is important to foresee how the data will be important to the deliverables. In the next process __data preparation__, it involves with the following processes: 
* Deriving new columns
* Merging data sources
* Data cleaning

This means in this process, we would look for new data. Often

## Modelling

After all the preparation work has done, it is time to plan to make the deliverables. In data analytics, we make __models__. These are the mathematical models. You can think them as a line plot in front of the computer screen. A model is how we conceptualise a concept in a consise way. For example, 
* We describe how the population of birds grow in a nature reserve through a line plot through times.
* After we find out why the sewage system blocked in the whole city, we could draw a map of the sewage system to show the situation. 

So once we have our datasets on hand, we can start modelling. There are many ways to model from data: 
* Statistical modelling (See __next week__) 
* Machine learning (See __week 9 and 10__) 
* Dynamical equation modelling
* Agent based modelling

Usually a statistical modelling means we find out how does a feature of the target we measure associate with another feature. So we draw a line diagram from it. For machine learning, it depends on your choice and you can check more from [Google](https://developers.google.com/machine-learning/problem-framing/cases). More often, you don't need to use machine learning to model your dataset. There is a worksheet from Google to help you whether you need machine learning or not. 

[https://developers.google.com/machine-learning/problem-framing/framing](https://developers.google.com/machine-learning/problem-framing/framing)

At the end of your modelling process. You will need to evaluate your model. This will be mentioned more next week upon the technical sides. Often you will need to measure your model with __model diagnostics__. Therefore, you will able to ask yourself the following questions: 
* Does my model contribute to what I want?
* Is there any way to optimise my computation?
* Any surprises (both good and bad) occured when modelling?
* Any mistakes and lessons learnt?

Think about these questions once we have done next week's work. 

## Deployment

At the end of your data analytics cycle, you will be writing summary about your model and then use it in real life. There are many examples: 
* Recommendation systems on Youtube
* Weather prediction 
* Spread of bushfires

While each dataset is unique, there is always possibilities to its applications. 

__Exercise:__ Think about what could be the applications from the data analysis we have just made. 

If you can't think of any applications on top of your head, that is fine. There are many ways to think about how to produce a useful application out of them. We call them as __deliverables__. In reality, people will: 
1. Summarise the results from their models
2. Create a step-by-step plan for deployment
3. You may then also wanted to see: 
    * Any factors or influences needed to follow up? 
    * Validity and accuracy of each model
    * Does the model become unrealistic?

After that you will need to write a final report to summarise your cycle. 

Before we finish, do you remember that in the CRISP-DM flowchart, what does the arrows look like? In fact, any business process is never uni-directional. Always remember to review your work. It will help you to detect any mistakes early as possible. 

## Conclusion

In this week, we have looked into the process of data analytics and 
* How to define a solid purpose for a data science project
* How to complete EDA
* Understand data analytics process

## Further Reading

* Trilling, D. (2019). _Doing Computational Social Science with Python: An Introduction_. [Online; Accessed on 16th December 2019]