# OH GOD WHAT NOW? 
# or 
# Data Science Topics I Didn't Cover and Where To Go From Here


## API Example

Let's say we want to extract some data from some api, like the google maps api. You will need several components for this process to work:

1. A library that can handle and process requests to the api (we will use `requests`)
2. A way to convert the data we get into a dictionary (since the json object itself is a string)
3. A way to convert the json object into a dataframe

Lets start by importing what we need:

In [1]:
import requests
from pandas.io.json import json_normalize

The example I will show you involves an api that doesnt require any keys (most api's require some authentication, but the iss position api does not).

This api simply returns the current lat/long coordinates of the International Space Station:

In [13]:
url = "http://api.open-notify.org/iss-now.json"
my_r = requests.get(url)

In [14]:
my_r.json()

{'iss_position': {'latitude': '-14.2017', 'longitude': '116.2056'},
 'timestamp': 1557625551,
 'message': 'success'}

In [15]:
my_r.json().keys()

dict_keys(['iss_position', 'timestamp', 'message'])

In [16]:
my_r.json()["iss_position"]

{'latitude': '-14.2017', 'longitude': '116.2056'}

In [17]:
json_normalize(my_r.json()["iss_position"])

Unnamed: 0,latitude,longitude
0,-14.2017,116.2056


## Basic Scraping Using BeautifulSoup

BeautifulSoup is a web scraping library that allows you to parse webpages and extract useful data from html (the format in which all webpages are displayed on the web).

Let's use `BeautifulSoup` on an example website (the reddit front page):

In [18]:
from bs4 import BeautifulSoup

First, we request the webpage and extract the content:

In [19]:
a_website = requests.get("http://google.com")
data = BeautifulSoup(a_website.content, "html5lib")

Let's take a look at what our data is:

In [20]:
data

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for." name="description"/><meta content="noodp" name="robots"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/><title>Google</title><script nonce="ogJjaanagrbnOkSuhB9uPg==">(function(){window.google={kEI:'33rXXLfzIeHB_Qaw6J7IDQ',kEXPI:'0,1353747,57,1957,2423,698,527,591,139,224,509,19,1047,1257,824,1070,58,320,207,145,872,535,70,31,69,338,2332552,329529,1294,12383,4855,22723,9969,15247,867,12163,5281,1100,3335,2,2,6801,363,3320,1262,4243,224,2212,266,4203,906,573,835,284,2,579,727,2432,58,2,1,3,1297,4323,3700,1267,774,2250,1407,3337,1146,5,2,2,1963,2595,3601,669,1050,1808,1397,81,7,1,2,488,620,29,1395,978,2632,4138,1161

In [21]:
link_list = data.find_all('a')

The data is the raw html from the website we just obtained. Now, what if we could extract specific things from that html, like all of the other webpages it links:

In [22]:
for links in data.find_all('a'):
    print (links.get('href'))

http://www.google.com/imghp?hl=en&tab=wi
http://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
http://www.youtube.com/?gl=US&tab=w1
http://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.com/
/advanced_search?hl=en&authuser=0
/language_tools?hl=en&authuser=0
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/


In [23]:
data.body.find_all("a")

[<a class="gb1" href="http://www.google.com/imghp?hl=en&amp;tab=wi">Images</a>,
 <a class="gb1" href="http://maps.google.com/maps?hl=en&amp;tab=wl">Maps</a>,
 <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">Play</a>,
 <a class="gb1" href="http://www.youtube.com/?gl=US&amp;tab=w1">YouTube</a>,
 <a class="gb1" href="http://news.google.com/nwshp?hl=en&amp;tab=wn">News</a>,
 <a class="gb1" href="https://mail.google.com/mail/?tab=wm">Gmail</a>,
 <a class="gb1" href="https://drive.google.com/?tab=wo">Drive</a>,
 <a class="gb1" href="https://www.google.com/intl/en/about/products?tab=wh" style="text-decoration:none"><u>More</u> »</a>,
 <a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>,
 <a class="gb4" href="/preferences?hl=en">Settings</a>,
 <a class="gb4" href="https://accounts.google.com/ServiceLogin?hl=en&amp;passive=true&amp;continue=http://www.google.com/" id="gb_70" target="_top">Sign in</a>,
 <a href="/advanced_search?hl=en&amp;authuser=0">

## Recommendation Engines (or Recommendation Systems)

![recommendation engine example](./images/recommendation_engine.png)

### Problem: How can we recommend the "most useful" or "most likely to be consumed and enjoyed" items (books/movies/songs/media/clothing/investment products/etc.) to people?

### 2 Basic Types of Recommendation Engines:

### **Content-based:** Use the properties of items to recommend "similar" items (PANDORA DOES THIS)

### **Collaborative:** Use the user's ratings on items and the ratings of others on those same items to recommend other items that "similar" users have also given high ratings (AMAZON/NETFLIX DO THIS)

### Are there Python libraries for this?

### [Turi](https://turi.com/index.html) has a really easy to use recommendation engine module, but its NOT FREE (BOO) AND WAS ALSO RECENTLY ACQUIRED BY APPLE
### In general, recommenders are built at scale using other languages (Python has pretty terrible support for recommendation engines/systems)

## Deep Learning

### What is it?

### A really fancy supervised classification/regression method

![artificial neural network example](./images/artificial_neural_network.png)

### What is it used on? Datasets that are really really big that require learning really really complicated representations:

### Pictures/Videos, Text:

![facial recognition deep learning example](./images/calista-deepface.png)

### What is it used for? Making object detection and labeling problems/classifying and rating text (sentiment analysis), low-level content "understanding"

### Are there Python libraries for this?

### [tensorflow](https://www.tensorflow.org) is google's new deep learning package. It just came out but everyone is going bonkers over it because GOOGLE.
### [theano](http://deeplearning.net/software/theano/) this is much more established, works very well with both your CPU and your graphics card (whaaaaat??)
### [keras](http://keras.io) is another Python library that you can use to build deep learning models and hook into either theano or tensorflow. 

## Natural Language Processing

### How can I use machine learning when I have freeform text instead of a matrix with rows and columns? How can I cluster texts? How can I try to model useful things from text generally?

### Are there Python libraries for this?
### [nltk](http://www.nltk.org/) is the grandaddy NLP library in Python, but its kinda fallen behind a bit
### [spacy](https://spacy.io/) is a newer, more modern NLP library
### [gensim](https://radimrehurek.com/gensim/index.html) is a very cool new library that allows you to learn word vector representations from text (so you can create text topics and other really cool stuff)

You can even do some basic NLP transformations in Scikit-learn:

We will go over 2 simple transformations to convert text into vectors:

1. **CountVectorizer**: Really large-scale "One-hot-encoding" across all tokens in the text
2. **TF-IDF**: Term frequency/Inverse Document Frequency - label documents according to the tokens that best distinguish it. Tokens that only appear in this document relative to the other documents in your dataset have high values, tokens that appear frequently across all documents get low values.

**Both of these approaches completely ignore word order and context, simply convert text into a dictionary of values.**

A token can be anything: a word, a character, a continuous group of 2,3,4 words, etc.

In [24]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import numpy as np

In [25]:
some_text = pd.DataFrame([["yowza thats awesome"],["Yowza yowza that is an awesome sauce"]],columns = ["text"])

In [26]:
some_text

Unnamed: 0,text
0,yowza thats awesome
1,Yowza yowza that is an awesome sauce


## Count Vectorizer

Think of it as large-scale one-hot encoding with some catches:

In [27]:
vectorizer = CountVectorizer(ngram_range=(1,6))#stop_words="english",stop_words="english")
transformed_text = vectorizer.fit_transform(some_text.text)

In [28]:
pd.DataFrame(transformed_text.todense(),columns =vectorizer.get_feature_names())

Unnamed: 0,an,an awesome,an awesome sauce,awesome,awesome sauce,is,is an,is an awesome,is an awesome sauce,sauce,...,yowza that is an,yowza that is an awesome,yowza that is an awesome sauce,yowza thats,yowza thats awesome,yowza yowza,yowza yowza that,yowza yowza that is,yowza yowza that is an,yowza yowza that is an awesome
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,1,0,0,0,0,0
1,1,1,1,1,1,1,1,1,1,1,...,1,1,1,0,0,1,1,1,1,1


In [29]:
vectorizer.get_feature_names()

['an',
 'an awesome',
 'an awesome sauce',
 'awesome',
 'awesome sauce',
 'is',
 'is an',
 'is an awesome',
 'is an awesome sauce',
 'sauce',
 'that',
 'that is',
 'that is an',
 'that is an awesome',
 'that is an awesome sauce',
 'thats',
 'thats awesome',
 'yowza',
 'yowza that',
 'yowza that is',
 'yowza that is an',
 'yowza that is an awesome',
 'yowza that is an awesome sauce',
 'yowza thats',
 'yowza thats awesome',
 'yowza yowza',
 'yowza yowza that',
 'yowza yowza that is',
 'yowza yowza that is an',
 'yowza yowza that is an awesome']

## TfidfVectorizer

- This computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- Much more useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents). A term that appears across all documents is incredibly uninformative, a term that appears 
- Commonly used for search engine scoring, text summarization, document clustering until ~2013-2014.

In [30]:
vect = TfidfVectorizer(ngram_range=(1,6))#
pd.DataFrame(vect.fit_transform(some_text.text).toarray(), columns=vect.get_feature_names())

Unnamed: 0,an,an awesome,an awesome sauce,awesome,awesome sauce,is,is an,is an awesome,is an awesome sauce,sauce,...,yowza that is an,yowza that is an awesome,yowza that is an awesome sauce,yowza thats,yowza thats awesome,yowza yowza,yowza yowza that,yowza yowza that is,yowza yowza that is an,yowza yowza that is an awesome
0,0.0,0.0,0.0,0.3178,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.446656,0.446656,0.0,0.0,0.0,0.0,0.0
1,0.194143,0.194143,0.194143,0.138134,0.194143,0.194143,0.194143,0.194143,0.194143,0.194143,...,0.194143,0.194143,0.194143,0.0,0.0,0.194143,0.194143,0.194143,0.194143,0.194143


## BIG DATA

### Problem: Data Science is great and all on my little laptop, but can it scale to billions/trillions/septillions of examples?

## YES! JUST ADD BIG DATA

![big data](./images/big_data.jpg)

### What this really means is, use api's that can scale across many many interconnected computers (servers)

### Can you BIG DATA in Python?

### [SPARK](http://spark.apache.org) has a Python api. It's a big data framework (currently the hottest big data framework) that takes many of the algorithms and approaches we learned (data analysis, exploration, etc.) and scales them to work across many computers (DataFrames are implemented in Spark, so is Logistic Regression/Linear Regression and Random Forests)
### [H20.ai](http://www.h2o.ai/) 

## OK WHAT DO I DO NEXT?

### 1. Use what you've learned (code in Python when you want to use Excel)
### 2. Go a bit deeper into the mathematics behind some of these algorithms (you need to know a little bit of 1st year calculus and a bit more linear algebra) by [reading one of the classic texts on the topic FOR FREE](http://statweb.stanford.edu/~tibs/ElemStatLearn/)
### 3. Learn more! Check [DataTau](http://www.datatau.com) every day. lurk on the [machine learning](https://www.reddit.com/r/MachineLearning/) and [statistics](https://www.reddit.com/r/statistics) subreddits (yes there are useful things on reddit).
### 4. Keep coding. As frequently as you can (or want).
### 5. Keep learning. Check out [metacademy](http://www.metacademy.org/roadmaps/). They have self-directed, free course road maps to LEVEL UP.
### 6. All of you are very capable of doing this. Keep going, because I personally think this stuff is really fun (and these skills will probably inform the future of every freaking industry).