# XPath
- Xpath is a query language to access specific contents or data on the web page
- Xpath is used to collect data from webpages


# XPath Cheat Sheets
- Go to http://ricostacruz.com/cheatsheets/xpath.html

### Additional resources

We just finished the first example using XPath for data collection. You should be more familar with this query language. More resources can be found:
- http://www.w3schools.com/xml/xpath_syntax.asp
- http://www.w3schools.com/xml/xpath_axes.asp
- http://www.w3schools.com/xml/xpath_operators.asp
- Finally, test your XPath skill http://www.w3schools.com/xml/xpath_operators.asp

# Example: IMDb
- Use Google Chrome for the following:
- Go to http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012 (1st page of Most Voted Feature Films Released 1950-2012)
- We will inspect **the source page** to develop **Xpath** for the information we're looking for. **Right Click** and select **Inspect**. This will show the source HTML code. 

In [None]:
# import python packages
import requests
from lxml import html
import csv
import pandas as pd

r = requests.get('http://www.imdb.com/search/title?at=0&sort=num_votes,desc&start=1&title_type=feature&year=1950,2012')
data = html.fromstring(r.text)

# Xpath
alldata =[]

for i in data.xpath("//h3[@class='lister-item-header']"):
    title = i.xpath('a/text()')  
    url = i.xpath('a/@href')        
    year = i.xpath('span[2]/text()')   
    print title, url, year
    alldata.append([title, url, year])
    
len(alldata)

In [None]:
# convert list to data frame (excel-like)
df = pd.DataFrame(alldata)
df.head(2)

## Important!!!
## Data cleaning & transformation
- This is considered one of the most important steps in data analytics. You may spend 60% to 70% of your time on data cleaning & transformation 
- You need to rely on **Stackoverflow.com**
- Also you should check **Working with Text Data in Pandas** https://pandas.pydata.org/pandas-docs/stable/text.html

In [None]:
# remove brackets

df[0]=df[0].str[0]
df[1]=df[1].str[0]
df[2]=df[2].str[0]
df.head(2)

In [None]:
df[1] = 'http://www.imdb.com' + df[1].astype(str)
df.head(2)

## regular expression (regex)
- We need to remove parenthesis from the data
- go to https://regexr.com/ and test your regex

In [None]:
df[2] = df[2].str.replace('\(|\)', '')
df.head(2)

In [None]:
df.to_csv("data/imdb.csv", index=False, encoding='utf-8')

# OpenCorporates (The Open Database of the Corporate World)

- We're interested in businesses in Kansas. The url is https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=1&q=
- Collect business names, urls, and addresses. 
- This is the second web page https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=2&q=

#### for more examples of Xpath, go to http://ricostacruz.com/cheatsheets/xpath.html

- **contains()**                 
    - div[contains(@class, "head")]
- starts-with()              
    - div[starts-with(@class, "head")]
- ends-with()                
    - div[ends-with(@class, "head")]

In [None]:
r = requests.get('https://opencorporates.com/companies/us_ks?current_status=Active+And+In+Good+Standing&page=1&q=')
data = html.fromstring(r.text)

# Xpath

for i in data.xpath("//li[contains(@class,'search-result')]"):
    title = i.xpath('a[2]/text()')   
    url = i.xpath('span[@class="address"]/a/@href')       
    address = i.xpath('span[@class="address"]/text()') 
    print title, url, address

# This is NOT perfect
#for i in data.xpath("//li"):
#    title = i.xpath('a[2]/text()')   
#    url = i.xpath('span[@class="address"]/a/@href')       
#    address = i.xpath('span[@class="address"]/text()') 
#    print title, url, address

# UK Parliament & Descriptive Analytics

- Let's find out the people in UK Parliament. The url is http://www.parliament.uk/mps-lords-and-offices/mps/
- Collect business names, urls, and district information. 
- ** Pay special attention to the xpath for political party**

In [None]:
r = requests.get('http://www.parliament.uk/mps-lords-and-offices/mps/')
data = html.fromstring(r.text)

# Xpath









# Rotten Tomatoes Movie Reviews

- Now, we're familar with how XPath works so we will do coding without using Google Sheets. 
- Go to https://www.rottentomatoes.com/m/interstellar_2014/reviews/?page=1&sort=
- Collect reviewer name, fresh/rotten, review, and date.
- There are 15 more webpages of reviews for this movie

In [None]:
r = requests.get('https://www.rottentomatoes.com/m/interstellar_2014/reviews/?page=1&sort=')
data = html.fromstring(r.text)

for i in data.xpath("//div[@class='row review_table_row']"):
    name = i.xpath('div/div/a[contains(@href, "critic")]/text()')
    sentiment = i.xpath('div[@class="col-xs-16 review_container"]/div[1]/@class')
    date = i.xpath('div[@class="col-xs-16 review_container"]/div[2]/div[1]/text()')
    review = i.xpath('div[@class="col-xs-16 review_container"]/div[2]/div[2]/div[1]/text()')
    print name, sentiment, date, review   

# More on Xpaths
- There are many examples of Xpath you need to be familiar with. 
- Go to http://ricostacruz.com/cheatsheets/xpath.html

#### Union
- Use **|** to join two expressions (e.g., //a | //span)

In [None]:
r = requests.get('http://www.pythonscraping.com/pages/page3.html')
data = html.fromstring(r.text)

In [None]:
fruit = []

#Xpath





In [None]:
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame(fruit)
df

#### This dataset needs much cleaning & transformation

In [None]:
# You could convert your list to str with astype(str) and then remove ', [, ] characters
# https://stackoverflow.com/questions/37347725/converting-a-panda-df-list-into-a-string/37347837

df[0] = df[0].astype(str).str.replace('\[|\]', '')
df[1] = df[1].astype(str).str.replace('\[|\]', '')
df

now, remove \n (newlines) from the dataset

In [None]:
#https://stackoverflow.com/questions/37160929/how-to-remove-carriage-return-in-a-dataframe
# r --> regular expression

df[0] = df[0].str.replace(r'\\n',' ')
df[1] = df[1].str.replace(r'\\n',' ')
df

In [None]:
df[0] = df[0].str.replace(r'\'','')
df[1] = df[1].str.replace(r'\'','')
df

In [None]:
df.to_csv("data/fruit.csv", index=False, encoding='utf-8')

# References
- Working with Text Data in Pandas https://pandas.pydata.org/pandas-docs/stable/text.html