<a href="https://colab.research.google.com/github/aelshehawy/PythonSocialDataScience/blob/main/Session_5_Python23_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Today's Plan**


- Pandas Revision
- Mini Visualization
- Scraping the Web
- Twitter API
- Mini NLP

# **Pandas**


![pandas1](https://media.giphy.com/media/Wa5JDuv6kzoTC/giphy.gif)


## Why is Pandas AMAZING?

1. super easy loading in data
2. easy data cleaning
3. easy data manipulation
4. easy merging and extraction



A pandas Series is a one-dimensional array. It holds any data type supported in Python and uses labels to locate each data value for retrieval.

In [None]:
import pandas as pd 

ser = pd.Series(["My", "name", "is", "ash", "hello"])
ser

In [None]:
ser = pd.Series(["My", "name", "is", "ash", "hello"], index = [1,2,3,4,5])
ser

Create the data as a dict of lists:

We will create again a shopping list!


In [None]:
my_data = {
    'drygroceries': ["pasta", "rice", "coffee", "beans"],
    'milkproducts': ["cheese", "cream", "sour cream", "yogurt"],
    'drinks': ['tea', 'coffee', 'water', 'juice']
}

In [None]:

# Index is also optional
my_index = ['a', 'b', 'c', 'd']

In [None]:
df = pd.DataFrame(data=my_data,index=my_index)
df

### Lets learn to load data in pandas

We are going to start with yesterdays data - German Sputnik Newspaper articles

**Lets try with pandas**

In [None]:
sputnikdata1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Oxford Text Analysis 22/Python QTA Drive Oxford 2022/Data/sputnikgerman20.tsv')

#why do we get an error message?

In [None]:
import pandas as pd

sputnikdata1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Oxford Text Analysis 22/Python QTA Drive Oxford 2022/Data/sputnikgerman20.tsv',sep='\t')
sputnikdata1.head()
#what do we want to change here

In [None]:
#first line is not header
sputnikdata = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Oxford Text Analysis 22/Python QTA Drive Oxford 2022/Data/sputnikgerman20.tsv',sep='\t', header=None)
sputnikdata

## Select Column

### how would I select the date?

In [None]:
sputnikdata_date=sputnikdata[2]
sputnikdata_date

### Select only content and title


In [None]:
sputnikdata_content_title=sputnikdata[[3,4]]
sputnikdata_content_title

## Select rows

In [None]:
#select 0-5

sputnikdata_rows=sputnikdata[0:1]
sputnikdata_rows

## Exporting data from Pandas


In [None]:
df.to_csv("mydata.csv", sep='\t', encoding='utf-8')

## Combining Datasets

We look at two commands in particular. For an in-depth explanation, see: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [None]:
import pandas as pd 
import numpy as np

df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Oxford Text Analysis 22/Python QTA Drive Oxford 2022/Data/sputnikgerman20.tsv',sep='\t',header=None)
df

In [None]:

#what if I want to rename?

col_name_dict =  {
    0: 'link1',
    1: 'link2',
    2: 'date',
    3: 'title',
    4: 'text',
    5: 'marker',
    6: 'keywords',
    7: 'numberwords',
    8: 'score'
}
df = df.rename(col_name_dict, axis=1)
df

In [None]:
df

In [None]:
df1 = df.loc[:, ['link1', 'link2', 'title']]
df2 = df.loc[:, ['date', 'text','link1']]

In [None]:
df1

In [None]:
df2

In [None]:
print(df1.shape)
print(df2.shape)
print(df.shape)


### `pd.merge`

In [None]:
pd.merge(df1, df2, on="link1")

# Visualization through counting

lets pre-process text

In [None]:
import string
exclude = set(string.punctuation) # if you see this not part of the punctuation -->”
import nltk
nltk.download('stopwords') #you can also download all libraries in nltk at once
from nltk.corpus import stopwords
stop_word_list = stopwords.words('german')
nltk.download('punkt')


def nlp_simple_pipeline(text):
    
    #it depends if the words have been lowercased or not
    text = text.lower()
    text=text.split()
    text = [token for token in text if token not in exclude and token.isalpha()]
    text = [token for token in text if token not in stop_word_list]

    return text


In [None]:
dataframe = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Oxford Text Analysis 22/Python QTA Drive Oxford 2022/Data/sputnikgerman20.tsv',sep='\t',header=None)
dataframe

In [None]:
dataframe["cleanedtitle"]=dataframe[3].apply(nlp_simple_pipeline) 
dataframe

In [None]:
dataframe['text'] =dataframe['cleanedtitle'].apply(lambda x: ' '.join(x))


In [None]:
dataframe

In [None]:
from collections import Counter
counts= Counter(" ".join(dataframe["text"]).split()).most_common(10)
counts

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

pd.DataFrame(counts, #this is what we want to visualize
             columns=['text','val']).set_index('text').plot(kind='bar');



# **How does Webscraping work?**
1: Send a request to download the site’s content.

2: Filter the page’s HTML to look for the nedded tags.

3: Print the text inside the target tags, producing the output in the format previously specified in the code.

## Requests

We use the `requests` library to make `html` requests and retrieve webpages.

In [None]:
import requests


You need to send an HTTP request to the server of the page you want to scrape. 

The server sends the HTML content of the page.

In [None]:
url = "http://books.toscrape.com"

session = requests.Session()
page = session.get(url)
print(page.text)

## BeautifulSoup

BeautifulSoup contains tools for navigating `html` code.

"Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information. It is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web."

Read more here: https://programminghistorian.org/en/lessons/retired/intro-to-beautiful-soup

**HTML: standard markup language for documents designed to be displayed in a web browser.**

In [None]:
from bs4 import BeautifulSoup

In [None]:
url = "http://books.toscrape.com"

page = session.get(url)
print(page.text)

In [None]:
#we use the html parser


soup = BeautifulSoup(page.text, 'html.parser')

In [None]:
# We print content of the document
print(soup.prettify())

In [None]:
# We want to get all the links to the page
# We can use the page 

links = soup.find_all('a', href=True)

links

In [None]:
# lets use a loop to print out all the content within the ahref label

for link in links:
    print(link.text, link['href'])

In [None]:
type(links)

In [None]:
# That output is annoying, so let's use some regex to remove whitespace
import re 
for link in links:
    print(link['href'],re.sub(re.compile(r"\s"),"", link.text) )
    
    #result = re.sub(pattern, repl, string, count=0, flags=0);
    #\s: white space charachter

In [None]:
# It'd be easier to read in a table, so let's use pandas!


import pandas as pd

links_df = pd.DataFrame(
    data={
        "Label": [re.sub(re.compile(r"\s"), "", link.text) for link in links],
        "Link": [link['href'] for link in links]
    }
)
links_df

*`Credits: Musashi Harukawa Webscraping Teaching Material 2020`*