# Week 7 - Data Science

## Drill

__Exercise:__ Write a function `hasvowel()` which accepts a word input and returns `True` if the word contains vowels. Think about the requirements of the argument and raise them if incorrect. `string.alpha()` might be helpful. 

In [None]:
# Solution
import string   # or from string import isalpha

'''
If the argument is not a word (i.e. with spaces or non-letter character/ not a string), then raise ValueError. 
'''
def hasvowel(word): 
    if type(word) != str: 
        raise ValueError
    # Check spaces and non-letter character
    if word.isalpha() == False: 
        raise ValueError
    
    if ('a' or 'e' or 'i' or 'o' or 'u') in word:
        return True   # Otherwise you will obtain False already
        

In [None]:
# Your code below


__Exercise:__ Write a code that will find the sum of the square of first 50 natural numbers. 

In [2]:
# Solution
sum = 0
for i in range(50): 
    sum += i ** 2
print(sum)

40425


In [None]:
# Your code below


__Exercise:__ Design a pandas dataframe which has the following specification. 
* `id`: Integers of 8 digits, starting by `10000000`.
* `name`: String characters.

After this, run your code and show the first 5 records of the dataframe you have created. 

In [7]:
# Solution
import pandas as pd
df = pd.DataFrame({'id':[10000000,10000001,10000002,10000003,10000004],'name':['Shelly','Kent','Martin','John','Myra']})
print(df.head(5))

         id    name
0  10000000  Shelly
1  10000001    Kent
2  10000002  Martin
3  10000003    John
4  10000004    Myra


In [None]:
# Your code below


## Data Science Process

In data analytics, there is a standard procedure where data scientists follow to commit a data science project. This is called __CRISP-DM__. It follows from generating the ideas, finding the ingredients to produce a data science product and report them. The following is the flowchart of the process. 

In [9]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="fig/CRISPDM_Process_Diagram.png", width=620)

# Image from https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome

## Setting Your Goal

In the first phase, the organisation understands the project objectives and the business requirements. Often they will summarise them in documents. This is the __business understanding__ phase. This is not contrainted to business, any organisation or __you__ have a goal, and this goal is where it guides you to the road of data analytics. 

__Exercise:__ Think about in your career, what is your objective? Think about how this object require data to help you. 

## Data Understanding

To start the data science project, we need to start with the relevant data sources. This is part of the __data understanding__ process. This involves the following steps: 
* Sourcing data
* Exploratory data analysis (EDA)

### Sourcing Data from Twitter

Twitter is a popular social media site where users can express themselves within some text length. It attracts a lot of attention from researchers to understand how opinion forms on Twitter. To convenient people understand what is inside the social network, Twitter offers an API that allows us to obtain the information in their server. 

API is the gateway to their servers. This means we need to obtain a key to get access. This is called the __public-key cryptography__. Which means we use the public key to get access into Twitter's server, and Twitter has a private key to encrypt their message back to us. This is only important when we are using the pair of keys as in below. 

So before to start anything, we should have a Twitter developer account beforehand. This can be obtained from [https://developer.twitter.com/](https://developer.twitter.com/). 

After that, you will need to obtain your __consumer keys__ and __access tokens__ (i.e. the public keys) on the developer portal. There are many guides available online, for example
* [https://towardsdatascience.com/how-to-access-twitters-api-using-tweepy-5a13a206683b](https://towardsdatascience.com/how-to-access-twitters-api-using-tweepy-5a13a206683b)
* [https://realpython.com/twitter-bot-python-tweepy/#creating-twitter-api-authentication-credentials](https://realpython.com/twitter-bot-python-tweepy/#creating-twitter-api-authentication-credentials)

Once you have created an application, you will need Twitter officially approve you. This can take from hours to few days. 

To fetch tweets from Twitter, we get help from the `tweepy` package. The following code is would fetch the relevant tweets to the keyword `'pokemon'` (last line of the snippnet). 
```python
import tweepy

APP_KEY = '***'
APP_SECRET = '***'

auth = tweepy.OAuthHandler('***', '***')
auth.set_access_token(APP_KEY, APP_SECRET)

api = tweepy.API(auth)

tweets = api.search(q = "pokemon", count = 100, result_type = "recent")
```
As you can see, we use 2 pairs of keys. The `APP_KEY` represents our access to the server, and then we access the session token which allows us to stay for the time being.  

`tweepy` provides a lot of methods that could enable programmers to obtain relevant tweets. For example the `search()` method returns relevant tweets from a keyword. `search_users()` returns relevant users from a keyword. 

More of these methods, and how to use them, can be find in [https://tweepy.readthedocs.io/en/latest/api.html](https://tweepy.readthedocs.io/en/latest/api.html). 

__Exercise:__ What does each line of the code mean? If possible, copy the snippnet and comment out each lines. 

In today, we are going to fetch some of the tweets and write them in json format. Json is a way to store complex data in text files and the program can recognise them afterwards. In python there is a native package called `json`, so we will call them at the start. 
```python
import json
```
To convert the information to json, we write 
```python
tweet_json = json.dump(tweets._json)
```
Thus we will need to store the data, this means we can write 
```python
with open('tweets.json', 'w') as json_file:
    json.dump(tweet_json, json_file)
```
and the tweet data is now stored in `tweets.json` in your current folder location. 

To read back the json file, we can use the following snippnet: 
```python
with open('data.txt') as json_file:
    data = json.load(json_file)
```

__Exercise:__ Write a code that will return tweets with the keyword `'Apple'`. Store them into the file called `apple_tweets.txt'. 

In [None]:
# Solution
import tweepy

APP_KEY = '***'
APP_SECRET = '***'

auth = tweepy.OAuthHandler('***', '***')
auth.set_access_token(APP_KEY, APP_SECRET)

api = tweepy.API(auth)

tweets = api.search(q = "Apple", count = 100, result_type = "recent")

with open('apple_tweets.txt', 'w') as json_file:
    json.dump(tweets, json_file)

In [None]:
# Your code below


As an extension, you can use a Twitter scraper to do more complex scraping. Given that one method call can fetch up to 1500 tweets, you can write a code to find tweets within a date range, and return more than 3200 at once. There are many of these online and often the authors have submitted theirs on thei Github repository. 

### Web Scraping

Internet contains a wealth of resources, so we can use python to extract information from the webpage. To scrape from the internet, we follow the following steps: 
1. Find the page that you want to scrape
2. Inspect the page
3. Write the code
4. Run the code and extract the data
5. Store the data in the required format 

In this exercise, we will try to scrape from news website. Our example would be from [https://www.nbcnews.com/better/health/what-headline-stress-disorder-do-you-have-it-ncna830141](https://www.nbcnews.com/better/health/what-headline-stress-disorder-do-you-have-it-ncna830141). Which we wish to obtain the article itself. 

This means we should have a look at the page itself, by means of clicking in the link and inspect them. You should see the similar screenshot as below. 

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="fig/Capture01.png", width=620)

You will see the webapage and it has not only just the news article itself. In well structured websites, there is a top bar to navigate to other pages of the news site. Often there will be some advertisements at the side of the article. If you scroll down, you will see the footer which shows the side information of the website as well. The article does not span the whole space in the middle of the page. There are spaces to put the author's information and the date of publish. 

__Exercise:__ Identify the page elements in the link. 

Our first step is to fetch the webpage onto our computer. In this way, we could potentially extract the information from the page. In this exercise, we will look at a package called `requests`. The following is the snippnet of code to extract the website and export the information. 

```python
import requests

htmlsource = requests.get('https://www.nbcnews.com/better/health/what-headline-stress-disorder-do-you-have-it-ncna830141').text
tree = html.fromstring(htmlsource)

with open('file/stress_news.html', mode = 'w') as fo:
    fo.write(htmlsource)

fo.close()
```

To extract the contents, there is a method to do so. Webpages are composed under a markup language called HTML. Each semantic contents are wrapped by a tag. For example, to make text bold, we can write `<b>I am bold.</b>` wich shows __I am bold.__ The bold find starts at the `<b>` tag and ends at `</b>` tag. A typical webpage should show 
```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
   <head>
      <title>My first HTML document</title>
   </head>
   <body>
      <p>Hello world!</p>
   </body>
</html>
```
Let us look at the tags in more detail: 
* The first line tells the computer that this is a webpage. 
* The `<head>` tag shows the meta information of the webpage. In this case, we define the title of the page under the `<head>` tag. 
* Anything showed in the `<body>` tag will appear on the screen (of your browser). In this case, there is a sentence 'Hello world!' is written and it is wrapped by the paragraph `<p>` tag. 
* Finally, every tags should wrap up by a end tag that starts with a `/`. For example, a paragrapha ends with a `</p>` tag. 

We cannot assume that our contents are spans within the whole `<body>` tag, so we will inspect it. To do so, at the page opened press __F12__. You should see a portal appeared left to your page with the HTML markups showed. 

Then scroll to the HTML markups and scroll to the place where the article is highlighting. This is not the place contains only the article, but there is the smallest tag contained the article. So click on the triangle beside the tag, and scroll to where the article is highlighted again. Repeat this process until you are able to see only the article contents are highlighted. 

__Exercise:__ What is the HTML tag that contains the whole article? You may want to include all the information inside the tag. 

It is a `<div>` tag and the whole tag reads 
```html
<div class="article-body__content">
```

Once you read the tag that contains all the information of the article itself, you should see it contains several `<p>` tags, these mean the paragraphs and you don't need to care about them. 

To extract the article text, the HTML tag we have is important. If you do not have the answer, just use the solution from above exercise. We will extract the article using this information. Given we have the destination tag, how could we tell the computer where it is? We use a concept called __XPath__ (XML Path Language). It shows the address from the top most level to the tag we after. 

To get the XPath of the article, right click on the tag we after and click copy the 'XPath' or the 'Full XPath'. 

In [2]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url="fig/Capture02.png", width=480)

For example, the full XPath that stores the article itself is 
`/html/body/div[2]/div/div[5]/div/div/div/article/div/div[2]/div[1]/div[2]`, while the XPath is `//*[@id="content"]/div/div[5]/div/div/div/article/div/div[2]/div[1]/div[2]`. We can see that `html` appears at the start of the XPath, then the `body`. Which appears to be the nested structure of the html tags. 

Furthermore, 
* `/` seperates the different levels. For example, `/html/body/` means the top layer is the `<html>` tag and the `<body>` is nested.
* `[]` indicates which branch of the tree is chosen. For instance, if there are 2 `<div>` tags after `<body>`, `div[2]` would select the second (i.e. last) one.
* `//` means 'arbitrary depth'. For example, `/html/body//p` means any paragraphs `<p>` under `<body`. Trilling (2019) adivses to 'start your XPATH with `//` to avoid make it shorter and avoid being too specific'.
* `*` means 'everything'. For instance, `p[*]` means all paragraphs. 
* `//*` means everything in the next layer. 
* `[@attribute="whatever"]` lets you select only those tags that contain a specific attribute. Common attributes could be `id`, `class`. Using above exercise as an example, if the `<div>` tag reads `<div class="reviews-single text">`. The XPath could write 
`//div[@class="article-body__content"]`. 
* `.` means relative location path from the tag we after. 

```python
from lxml import html

article_content = tree.xpath('//*[@id="content"]/div/div[5]/div/div/div/article/div/div[2]/div[1]/div[2]')

article_content_text = [r.text.strip() for r in article_content]
print(blog_content_text)
```

### EDA

_Some of contents come from https://github.com/info-370/eda._

Exploratory data analysis (__EDA__) is the process to understand the datasets sourced. This process starts by summarising the data structures and then visualise a selected features to quickly understand univariate distributions and relationships between variables. Initial EDA questions ask basic questions, including:

* How large is the dataset (rows, columns)?
* What are the variables present in the dataset?
* What is the data type of each variable?

In this week, we will start some work on EDA with a marks dataset and a text dataset. You can source your own dataset as well. 

### Data Cleaning

It also involves data cleaning means to find out the outliers, for example 
* Which data points has value too big?
* Which data points are ilogical?
* How many missing data do we have?

Often this is done once we have a dataset. For the purose of this week, it is done in the next step. In pratice, data cleaning has to be done when we obtain a new piece of dataset. In the following, let us use some of the templates to clean the dataset. 

## Data Preparation

In data science, understanding data initiates the project. It is important to foresee how the data will be important to the deliverables. In the next process __data preparation__, it involves with the following processes: 
* Deriving new columns
* Merging data sources
* Data cleaning

This means in this process, we would look for new data. Often

## Plan

After all the preparation work has done, it is time to plan to make the deliverables. 

## Conclusion

In this week, we have looked into the process of data analytics and 
* How to define a solid purpose for a data science project
* How to complete EDA

## Further Reading

* Trilling, D. (2019). _Doing Computational Social Science with Python: An Introduction_. [Online; Accessed on 16th December 2019]