# Data Science Skills and Examples in Python
## Basics

### Lists

### Functions

### Numpy

### MatplotLib

### Dictionaries and Pandas

### Logic & Filtering

### Loops

## Data Science Toolbox

### Create your own function

### Arguments & Scope

### Lambda Functions & Error-Handling

Lambda Functions refers to anonymous functions or functions without a defined name.

### Iterators

### List Comprehensions

## Importing Data

### Flat files

### Relational Databases

### Other File Types

### URLs and Web Scrapping

In [7]:
# Import packages
import requests
from bs4 import BeautifulSoup


# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

<title>Guido's Personal Home Page</title>


Guido's Personal Home Page




Guido van Rossum - Personal Home Page
"Gawky and proud of it."
Who
I Am
Read
my "King's
Day Speech" for some inspiration.

I am the author of the Python
programming language.  See also my resume
and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
pictures of me,
my new blog, and
my old
blog on Artima.com.  I am
@gvanrossum on Twitter.  I
also have
a G+
profile.

In January 2013 I joined
Dropbox.  I work on various Dropbox
products and have 50% for my Python work, no strings attached.
Previously, I have worked for Google, Elemental Security, Zope
Corporation, BeOpen.com, CNRI, CWI, and SARA.  (See
my resume.)  I created Python while at CWI.

How to Reach Me
You can send email for me to guido (at) python.org.
I read everything sent there, but if you ask
me a question about using Python, it's likely that I won't have time
to answer it, and will instead ref



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


### APIs

In [4]:
import requests
import json

url = 'https://stmossberg.managerpluscloud.com/v16/api/PurchaseOrdersAndLines?api_key=[your_api_key]&$skip=40'
url = 'http://www.omdbapi.com/?apikey=ff21610b&t=social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

Title:  The Social Network
Year:  2010
Rated:  PG-13
Released:  01 Oct 2010
Runtime:  120 min
Genre:  Biography, Drama
Director:  David Fincher
Writer:  Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:  Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:  Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:  English, French
Country:  USA
Awards:  Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:  https://m.media-amazon.com/images/M/MV5BMTM2ODk0NDAwMF5BMl5BanBnXkFtZTcwNTM1MDc2Mw@@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '95%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating:  7.7
imdbVotes:  550,434
imdbID:  tt1285016
Type:  movie
DVD:  11 Jan 2011
BoxOffice:  $96,40

To get data from nested JSON such as 
{'batchcomplete': '',
 'query': {'normalized': [{'from': 'pizza', 'to': 'Pizza'}],
  'pages': {'24768': {'extract': '<p class="mw-empty-elt"></p>
We only need to specify each nested level we want to dig into.

pizza_extract = json_data['query']['pages']['24768']['extract']

### Further API Examples

In [6]:
# Import package
import tweepy

# Store OAuth authentication credentials in relevant variables
access_token = "[your_token]"
access_token_secret = "[your_ts]"
consumer_key = "[your_key]"
consumer_secret = "[your_cs]"

# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Initialize Stream listener
l = MyStreamListener()

# Create your Stream object with authentication
stream = tweepy.Stream(auth, 1)

# Filter Twitter Streams to capture data by the keywords:
stream.filter(track=['clinton', 'trump', 'sanders', 'cruz'])

"""Load saved File"""
# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])

# Second load
# Import package
import json

# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'

# Initialize empty list to store tweets: tweets_data
tweets_data = []

# Open connection to file
tweets_file = open(tweets_data_path, "r")

# Read in tweets and store in list: tweets_data
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())

"""Build a dataframe from the tweets"""
# Import package
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text', 'lang'])

# Print head of DataFrame
print(df.head())

"""Analyze Tweets"""
import re

def word_in_text(word, text):
    word = word.lower()
    text = tweet.lower()
    match = re.search(word, text)

    if match:
        return True
    return False

# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders', row['text'])
    cruz += word_in_text('cruz', row['text'])

"""Graph Results"""
# Import packages
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style
sns.set(color_codes=True)

# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']

# Plot histogram
#  the first argument should be the labels to appear on the x-axis; 
# the second argument should be the list of the variables you wish to plot
ax = sns.barplot(cd, [clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()

ModuleNotFoundError: No module named 'tweepy'

## Data Munging

### Exploring Data

### Tidying Data

### Combining Data
#### Concatenate

In [1]:
# Create an empty list: frames
frames = []

#  Iterate over csv_files using provided csv_files(preloaded)
for csv in csv_files:

    #  Read csv into a DataFrame: df
    df = pd.DataFrame(pd.read_csv(csv))
    
    # Append df to frames
    frames.append(df)

# Concatenate frames into a single DataFrame: uber
uber = pd.concat(frames)

# Print the shape of uber
print(uber.shape)

# Print the head of uber
print(uber.head())

NameError: name 'csv_files' is not defined

#### Merging
example uses pre-loaded tables called site and visited. 

In [2]:
# Like SQL Joins, specify the columns you use to identify the union column
o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print o2o
print(o2o)

NameError: name 'pd' is not defined

### Cleaning Data and RegEx

Regular expressions or RegEx for short are special text string for describing a search pattern.
##### Best Practice is to:
1. Compile the pattern
2. Use the compiled pattern to match search values

##### Example of RegEx Code
You often have to escape special characters that are used to signal other actions in python.  These can include special characters and even letters.  The \ is used to escape characters in Python.

![image.png](attachment:image.png)
Astrarisk `*` means to find all, no holds bar.  This will matche zero or more occurrences of the regular expression. 
* d stands for digits.
* `\d*` will find a number of any length
* `\$\d*` will find a digit preceeded by the dollar sign.
* using curly braces around a number will specify how far out to go.  In the above example we are going to search out to two decimal places.
* Carrots ^ tell python to begin the pattern match at the begining of the value and the `$` will tell the pattern to match at the end of the value. This prevents the match of a string that could say "I have $99.32 USD".  It will only search for the precise matches.

For complete list of RegEx symbols, visit:
https://docs.python.org/2/library/re.html

In [8]:
# Import the regular expression module
import re

# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')

# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))

# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))

True
False


In [9]:
"""Find additional Patterns"""
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)

# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)

# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)


True
True
True


In [11]:
# Create a function that recodes Male to 1 and Female to 0 using existing a pre-existing dataset called tips.
# Define recode_sex()
def recode_sex(sex_value):

    # Return 1 if sex_value is 'Male'
    if sex_value == 'Male':
        return 1
    
    # Return 0 if sex_value is 'Female'    
    elif sex_value == 'Female':
        return 0
    
    # Return np.nan    
    else:
        return np.nan

# Apply the function to the sex column
tips['sex_recode'] = tips.sex.apply(recode_sex)
# Print the first five rows of tips
print(tips.head())

NameError: name 'tips' is not defined

##### Drop Special Symbols

In [12]:
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
print(tips.head())

NameError: name 'tips' is not defined

##### Fill missing data with the mean

In [13]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())

NameError: name 'airquality' is not defined