![image](google-header.png)

## **Google Search Analysis**

So in the previous workbook we explored getting the data from Google. Awesome! Lets take a step further and see if we can sift through those results and find the outliers.

## **Overview**
The following cells will do some data exploration, basic cleanup, some feature engineering, and then clustering and text analysis.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import seaborn as sns
import spacy
from spacy import displacy

RAW_DATA = 'data'

### **Create NLP Model**

In [2]:
nlp = spacy.load('en_core_web_md')

### **Load Pandas DF**

In [4]:
results = pd.read_csv(Path(RAW_DATA).joinpath('search_results.tsv'), sep='\t')
results.head()

Unnamed: 0,title,link,displayLink,snippet,formattedUrl,pagemap,mime,fileFormat,cacheId
0,10 Turfway Park,https://www.espn.com/media/horse/RACE5_060325.pdf,www.espn.com,10 Turfway Park. LanesEnd-G2. 1 MILES (1:49 35...,https://www.espn.com/media/horse/RACE5_060325.pdf,"{'metatags': [{'moddate': 'D:20060323171220', ...",application/pdf,PDF/Adobe Acrobat,
1,"""i'd do anything"" official rules for applicant...",https://www.espn.com/eoe/doanything/IDA2_rules...,www.espn.com,"""I'd Do Anything"" is a reality/game show (the ...",https://www.espn.com/eoe/doanything/IDA2_rules...,"{'metatags': [{'moddate': ""D:20050217104133-08...",application/pdf,PDF/Adobe Acrobat,-ZvW4YmGicQJ
2,critique of the freeh report: the rush to inju...,https://www.espn.com/pdf/2013/0210/espn_otl_FI...,www.espn.com,KING & SPALDING: FEBRUARY 2013. WICK SOLLERS. ...,https://www.espn.com/pdf/.../espn_otl_FINAL%20...,,application/pdf,PDF/Adobe Acrobat,
3,police wrote,https://www.espn.com/pdf/2015/0614/IanWalker_t...,www.espn.com,"Jun 14, 2015 ... Age : 22. Occupation/Vocation...",https://www.espn.com/pdf/2015/0614/IanWalker_t...,"{'metatags': [{'moddate': ""D:20150613140101-04...",application/pdf,PDF/Adobe Acrobat,w3VSpEpWQfcJ
4,the police report states,https://www.espn.com/pdf/2015/0614/Perry_fight...,www.espn.com,"Jun 14, 2015 ... On April 13, 2014 at 2048 hou...",https://www.espn.com/pdf/2015/0614/Perry_fight...,{'cse_thumbnail': [{'src': 'https://encrypted-...,application/pdf,PDF/Adobe Acrobat,a0onrzRAzDkJ


### **Interpret Snippet**

In [37]:
for snip in results['snippet'][0:4]:
    print('='*40)
    doc = nlp(snip)
    displacy.render(doc, style='ent', jupyter=True)
    print('='*40)

(10, Turfway Park, LanesEnd-G2, 1 MILES, 35th, 500,000, Three-Year-Olds, Colts, 121 lbs)


(ESPN Productions, Inc., ESPN)


(SPALDING, FEBRUARY 2013, MARK JENSEN, ALAN 
DIAL, DREW CRAWFORD)


(Jun 14, 22, DL/ID State, FL, FSU, United States, Min, Height, 5'11)




In [38]:
doc = nlp(results['snippet'][0])
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

10 CARDINAL
Turfway Park FAC
LanesEnd-G2 ORG
1 MILES QUANTITY
35th ORDINAL
500,000 MONEY
Three-Year-Olds DATE
Colts NORP
121 lbs QUANTITY
