<h1> Introduction to Data Science in Python </h1>
> Hillary Green-Lerman


<h3> Course Description </h3>
Begin your journey into Data Science! Even if you've never written a line of code in your life, you'll be able to follow this course and witness the power of Python to perform Data Science. You'll use data to solve the mystery of Bayes, the kidnapped Golden Retriever, and along the way you'll become familiar with basic Python syntax and popular Data Science modules like Matplotlib (for charts and graphs) and pandas (for tabular data).

---------

<h3> [2] Loading Data in Pandas </h3>

In this chapter, you'll learn a powerful Python libary: pandas. pandas lets you read, modify, and search tabular datasets (like spreadsheets and database tables). You'll examine credit card records for the suspects and see if any of them made suspicious purchases.

<b> Exercise - Loading a DataFrame </b>

We're still working hard to solve the kidnapping of Bayes, the Golden Retriever. Previously, we used a license plate spotted at the crime scene to narrow the list of suspects to:

Fred Frequentist
Ronald Aylmer Fisher
Gertrude Cox
Kirstine Smith
We've obtained credit card records for all four suspects. Perhaps some of them made suspicious purchases before the kidnapping?

The records are in a CSV called "credit_records.csv".

In [None]:
# Import pandas under the alias pd
import pandas as pd

# Load the CSV "credit_records.csv"
credit_records = pd.read_csv("credit_records.csv")

# Display the first five rows of credit_records using the .head() method
credit_records.head()

**Exercise - Inspecting a DataFrame**

We've loaded the credit card records of our four suspects into a DataFrame called credit_records. Let's learn more about the structure of this DataFrame.

The pandas module has been imported under the alias pd. The DataFrame credit_records has already been imported.

How many rows are in credit_records?

In [None]:
# Use .info() to inspect the DataFrame credit_records
print(credit_records.info()

------------

**Common mistakes in column selection**

1. Use brackets and string for column names with spaces or special characters (- , ? , etc)

`police_report['Is Golden Retriever?']`

2. When using brackets and string, don't forget the quotes and the column name.

`credit_report['location']` 

**NOT**

`credit_report[location]`

Python will think the column name is a variable that hasn't been defined yet.

3. Brackets, not parenthesis

If parentheses is used, Python will think that the DataFrame is used as a function, and will give a "TypeError".

**Exercise - Two methods of selecting columns**

Once again, we've loaded the credit card records of our four suspects into a DataFrame called credit_records. Let's examine the items that they've purchased.

The pandas module has been imported under the alias pd. The DataFrame credit_records has already been imported.

In [None]:
# Select the column item from credit_records
# Use brackets and string notation
items = credit_records['item']

# Display the results
print(items)

In [None]:
# Select the column item from credit_records
# Use dot notation
items = credit_records.item

# Display the results
print(items)

**Exercise - Correcting column selection errors**

A junior detective tried to access the location columns of credit_records, but he made some mistakes. Help correct his code so that we can search for suspicious purchases.

In all exercises going forward, pandas will be imported as pd. The DataFrame credit_records has already been imported.

In [None]:
# One or more lines of code contain errors.
# Fix the errors so that the code runs.

# Select the location column in credit_records
location = credit_records['location']

# Select the item column in credit_records
items = credit_records.item

# Display results
print(location)

-----------------

**More column selection mistakes**

Another junior detective is examining a DataFrame of Missing Puppy Reports. He's made some mistakes that cause the code to fail.

The pandas module has been loaded under the alias pd, and the DataFrame is called mpr.

In [None]:
# Use info() to inspect mpr
print(mpr.info())

In [None]:
# Use info() to inspect mpr
print(mpr.info())

# The following code contains one or more errors
# Correct the mistakes in the code so that it runs without errors

# Select column "Dog Name" from mpr
name = mpr['Dog Name']

# Select column "Missing?" from mpr
is_missing = mpr['Missing?']

# Display the columns
print(name)
print(is_missing)

--------------

### Selecting Rows with Logic

**Logical Statements in Python**
- checks for a relationship between two vlaues (such as "equal to" or "greater than"), and returns Boolean (True or False)

In [None]:
question = 12 * 8
solution = 96
question == solution

**Exercise - Logical Testing**

Recall that we use the following operators:

- == tests that two values are equal.
- = tests that two values are not equal.
- .> and < test that greater than or less than, respectively.
- .>= and <= test greater than or equal to or less than or equal to, respectively.

**Instructions**

1. The variable height_inches represents the height of a suspect. Is height_inches greater than 70 inches?
2. The variable plate1 represents a license plate number of a suspect. Is it equal to FRQ123?
3. The variable fur_color represents the color of Bayes' fur. Check that fur_color is not equal to "brown".

In [None]:
# Is height_inches greater than 70 inches?
print(height_inches > 70)

# Is plate1 equal to "FRQ123"?
print(plate1 == "FRQ123")

# Is fur_color not equal to "brown"?
print(fur_color != "brown")

**Exercise - Selecting missing puppies**

Let's return to our DataFrame of missing puppies, which is loaded as mpr. Let's select a few different rows to learn more about the other missing dogs.

**Instructions**
- Select the dogs where Age is greater than 2.
- Select the dogs whose Status is equal to Still Missing.
- Select all dogs whose Dog Breed is not equal to Poodle.

In [None]:
# Select the dogs where Age is greater than 2
greater_than_2 = mpr[mpr.Age > 2]
print(greater_than_2)

# Select the dogs whose Status is equal to Still Missing
still_missing = mpr[mpr.Status ==  'Still Missing']
print(still_missing)

# Select all dogs whose Dog Breed is not equal to Poodle
not_poodle = mpr[mpr['Dog Breed'] != 'Poodle']
print(not_poodle)

**Exercise - Narrowing the list of suspects**

In Chapter 1, we found a list of people whose cars matched the description of the one that kidnapped Bayes:

Fred Frequentist
Ronald Aylmer Fisher
Gertrude Cox
Kirstine Smith

We'd like to narrow this list down, so we obtained credit card records for each suspect. We'd like to know if any of them recently purchased dog treats to use in the kidnapping. If they did, they would have visited 'Pet Paradise'.

The credit records have been loaded into a DataFrame called credit_records.

**Instructions**
- Select rows of credit_records such that the column location is equal to 'Pet Paradise'.

In [None]:
# Select purchases from 'Pet Paradise'
purchase = credit_records[credit_records.location == 'Pet Paradise']

# Display
print(purchase)

---

## [3] Creating Line Plots

Line Plot
- uses a coordinate grid to plot a series of points and then connects each point using a line.

**Exercise - Working Hard**

Several police officers have been working hard to help us solve the mystery of Bayes, the kidnapped Golden Retriever. Their commanding officer wants to know exactly how hard each officer has been working on this case. Officer Deshaun has created DataFrames called `deshaun` to track the amount of time he spent working on this case. The DataFrame contains two columns:

`day_of_week`: a string representing the day of the week
`hours_worked`: the number of hours that a particular officer worked on the Bayes case'


**Instructions**

Plot Officer Deshaun's hours worked using the columns `day_of_week` and `hours_worked` from `deshaun`.

In [None]:
# From matplotlib, import pyplot under the alias plt
from matplotlib import pyplot as plt

# Plot Officer Deshaun's hours_worked vs. day_of_week
plt.plot(deshaun.day_of_week, deshaun.hours_worked)

# Display Deshaun's plot
plt.show()

![image.png](attachment:image.png)

In [None]:
# <div>
#<img src = " attachment:image.png" width="500" height="340"/>
# </div>

**Or hardly working?** 

Two other officers have been working with Deshaun to help find Bayes. Their names are Officer Mengfei and Officer Aditya. Deshaun used their time cards to create two more DataFrames: `mengfei` and `aditya`. In this exercise, we'll plot all three lines together to see who was working hard each day.

We've already loaded `matplotlib` under the alias `plt`.

**Instructions**

- Plot Officer Aditya's time worked with day_of_week on the x-axis and hours_worked on the y-axis.
- Plot Officer Mengfei's time worked with day_of_week on the x-axis and hours_worked on the y-axis.

In [None]:
# Plot Officer Deshaun's hours_worked vs. day_of_week
plt.plot(deshaun.day_of_week, deshaun.hours_worked)

# Plot Officer Aditya's hours_worked vs. day_of_week
plt.plot(aditya.day_of_week, aditya.hours_worked)

# Plot Officer Mengfei's hours_worked vs. day_of_week
plt.plot(mengfei.day_of_week, mengfei.hours_worked)

# Display all three line plots
plt.show()

![image.png](attachment:image.png)

> The orange line has no hours worked on Thursday or Friday.

---

### Adding text to plots

**Adding labels and plot title** <br>
*always before `plt.show`* <br>
`plt.xlabel` <br>
`plt.ylabel` <br>
`plt.title`

**Adding legends** <br>
1. Add keyword argument *label* to each of plt.plot
2. Add final function: `plt.legend()`


### Arbitrary Text

`plt.text`

Takes three (3) argument:
1. x - coordinate where we want to put the text
2. y - coordinate where we want to put the text
3. text we want to display as a string

**Exercise - Adding a legend**

Officers Deshaun, Mengfei, and Aditya have all been working with you to solve the kidnapping of Bayes. Their supervisor wants to know how much time each officer has spent working on the case.

Deshaun created a plot of data from the DataFrames `deshaun`, `mengfei`, and `aditya` in the previous exercise. Now he wants to add a legend to distinguish the three lines.

**Instructions**
1. Using the keyword `label`, label Deshaun's plot as "Deshaun".
2. Add labels to Mengfei's ("Mengfei") and Aditya's ("Aditya") plots.
3. Add a command to make the legend display.

In [None]:
# Officer Deshaun
plt.plot(deshaun.day_of_week, deshaun.hours_worked, label='Deshaun')

# Add a label to Aditya's plot
plt.plot(aditya.day_of_week, aditya.hours_worked, label='Aditya')

# Add a label to Mengfei's plot
plt.plot(mengfei.day_of_week, mengfei.hours_worked, label='Mengfei')

# Add a command to make the legend display
plt.legend()

# Display plot
plt.show()

![image.png](attachment:image.png)

>  The Mengfei's line has no hours worked on Monday and Tuesday.

**Exercise - Adding labels**

If we give a chart with no labels to Officer Deshaun's supervisor, she won't know what the lines represent.

We need to add labels to Officer Deshaun's plot of hours worked.

**Instructions**
1. Add a descriptive title to the chart.
2. Add a label for the y-axis.

In [None]:
# Lines
plt.plot(deshaun.day_of_week, deshaun.hours_worked, label='Deshaun')
plt.plot(aditya.day_of_week, aditya.hours_worked, label='Aditya')
plt.plot(mengfei.day_of_week, mengfei.hours_worked, label='Mengfei')

# Add a title
plt.title("Deshaun's Worked Hours")

# Add y-axis label
plt.ylabel("Worked Hours")

# Legend
plt.legend()
# Display plot
plt.show()

![image.png](attachment:image.png)

**Adding floating text**

Officer Deshaun is examining the number of hours that he worked over the past six months. The number for June is low because he only had data for the first week. Help Deshaun add an annotation to the graph to explain this.

**Instructions**
- Place the annotation "Missing June data" at the point (2.5, 80).

In [None]:
# Create plot
plt.plot(six_months.month, six_months.hours_worked)

# Add annotation "Missing June data" at (2.5, 80)
plt.text(2.5, 80, "Missing June Data")

# Display graph
plt.show()

![image.png](attachment:image.png)

----

### Styling graphs

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)
![image-4.png](attachment:image-4.png)
![image-5.png](attachment:image-5.png)

### Exercise - Tracking crime statistics

Sergeant Laura wants to do some background research to help her better understand the cultural context for Bayes' kidnapping. She has plotted Burglary rates in three U.S. cities using data from the Uniform Crime Reporting Statistics.

She wants to present this data to her officers, and she wants the image to be as beautiful as possible to effectively tell her data story.

**Recall:**

You can change linestyle to dotted (':'), dashed('--'), or no line ('').
You can change the marker to circle ('o'), diamond('d'), or square ('s').

**Instructions**
- Change the color of Phoenix to "DarkCyan".
- Make the Los Angeles line dotted.
- Add square markers to Philadelphia.

In [None]:
# Change the color of Phoenix to `"DarkCyan"`
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix", color = "DarkCyan")

# Make the Los Angeles line dotted
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles", linestyle = ":")

# Add square markers to Philedelphia
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia", marker = "s")

# Add a legend
plt.legend()

# Display the plot
plt.show()

![image.png](attachment:image.png)

### Playing with Styles

Help Sergeant Laura try out a few different style options. Changing the plotting style is a fast way to change the entire look of your plot without having to update individual colors or line styles. Some popular styles include:

`'fivethirtyeight'` - Based on the color scheme of the popular website
`'grayscale'` - Great for when you don't have a color printer!
`'seaborn'` - Based on another Python visualization library
`'classic'` - The default color scheme for Matplotlib

**Instructions:**
- Change the plotting style to "fivethirtyeight".

In [None]:
# Change the style to fivethirtyeight
plt.style.use('fivethirtyeight')

# Plot lines
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia")

# Add a legend
plt.legend()

# Display the plot
plt.show()

![image.png](attachment:image.png)

In [None]:
# Change the style to ggplot
plt.style.use('ggplot')

# Plot lines
plt.plot(data["Year"], data["Phoenix Police Dept"], label="Phoenix")
plt.plot(data["Year"], data["Los Angeles Police Dept"], label="Los Angeles")
plt.plot(data["Year"], data["Philadelphia Police Dept"], label="Philadelphia")

# Add a legend
plt.legend()

# Display the plot
plt.show()

![image.png](attachment:image.png)

### Identifying Bayes' kidnapper
We've narrowed the possible kidnappers down to two suspects:

- Fred Frequentist (`suspect1`)
- Gertrude Cox (`suspect2`)

The kidnapper left a long ransom note containing several unusual phrases. Help DataCamp by using a line plot to compare the frequency of letters in the ransom note to samples from the two main suspects.

Three DataFrames have been loaded:

- `ransom` contains the letter frequencies for the ransom note.
- `suspect1` contains the letter frequencies for the sample from Fred Frequentist.
- `suspect2` contains the letter frequencies for the sample from Gertrude Cox.

Each DataFrame contain two columns letter and frequency.

**Instructions**
- Plot the letter frequencies from the ransom note. The x-values should be `ransom.letter`. The y-values should be `ransom.frequency`. The label should be the string `'Ransom'`. The line should be dotted and `gray`.

In [None]:
# x should be ransom.letter and y should be ransom.frequency
plt.plot(ransom.letter, ransom.frequency,
         # Label should be "Ransom"
         label = "Ransom",
         # Plot the ransom letter as a dotted gray line
         linestyle = ':', color = 'gray')

# Display the plot
plt.show()

![image.png](attachment:image.png)

In [None]:
# Plot each line
plt.plot(ransom.letter, ransom.frequency,
         label = 'Ransom', linestyle = ':', color = 'gray')

# X-values should be suspect1.letter
# Y-values should be suspect1.frequency
# Label should be "Fred Frequentist"
plt.plot(suspect1.letter, suspect1.frequency, label = "Fred Frequentist")

# Display the plot
plt.show()

![image.png](attachment:image.png)

In [None]:
# Plot each line
plt.plot(ransom.letter, ransom.frequency,
         label = 'Ransom', linestyle = ':', color = 'gray')
plt.plot(suspect1.letter, suspect1.frequency, label = 'Fred Frequentist')
plt.plot(suspect2.letter, suspect2.frequency, label = 'Gertrude Cox')

# Add x- and y-labels
plt.xlabel("Letter")
plt.ylabel("Frequency")

# Add a legend
plt.legend()

# Display plot
plt.show()

![image-2.png](attachment:image-2.png)

> It looks like Fred Frequentist is the kidnapper. Both the ransom and Fred have low frequencies of H and high frequency of P.

-----

## [4] Making a scatter plot

**Exercise - Charting cellphone data**

We know that Freddy Frequentist is the one who kidnapped Bayes the Golden Retriever. Now we need to learn where he is hiding.

Our friends at the police station have acquired cell phone data, which gives some of Freddie's locations over the past three weeks. It's stored in the DataFrame `cellphone`. The x-coordinates are in the column `'x'` and the y-coordinates are in the column `'y'`.

The matplotlib module has been imported under the alias plt.

**Instructions**
- Display the first five rows of the DataFrame and determine which columns to plot.
- Create a scatter plot of the data in cellphone.

In [None]:
# Explore the data
print(cellphone.head())

# Create a scatter plot of the data from the DataFrame cellphone
plt.scatter(cellphone.x, cellphone.y)

# Add labels
plt.ylabel('Latitude')
plt.xlabel('Longitude')

# Display the plot
plt.show()

![image.png](attachment:image.png)

**Exercise - Modifying a scatterplot**

In the previous exercise, we created a scatter plot to show Freddy Frequentist's cell phone data.

In this exercise, we've done some magic so that the plot will appear over a map of our town. If we just plot the data as we did before, we won't be able to see the map or pick out the areas with the most points. We can fix this by changing the colors, markers, and transparency of the scatter plot.

As before, the matplotlib.pyplot module has been imported under the alias plt, and the cellphone data is in the DataFrame cellphone.

**Instructions**
- Change the color of the points to 'red'.
- Change the marker shape to square.
- Change the transparency of the scatterplot to 0.1.

In [None]:
# Change the transparency to 0.1
plt.scatter(cellphone.x, cellphone.y,
           color='red',
           marker='s',
           alpha = 0.1)

# Add labels
plt.ylabel('Latitude')
plt.xlabel('Longitude')

# Display the plot
plt.show()

![image-2.png](attachment:image-2.png)

> Freddy has been spending a lot of time in Blue Meadows Park, Happy Mountain Trailhead, and Shady Groves Campsite.

### Build a simple bar chart

Officer Deshaun wants to plot the average number of hours worked per week for him and his coworkers. He has stored the hours worked in a DataFrame called hours, which has columns officer and avg_hours_worked. Recall that the function plt.bar() takes two arguments: the labels for each bar, and the height of each bar. Both of these can be found in our DataFrame.

**Exercise - Build a simple bar chart**

Officer Deshaun wants to plot the average number of hours worked per week for him and his coworkers. He has stored the hours worked in a DataFrame called `hours`, which has columns `officer` and `avg_hours_worked`. Recall that the function plt.bar() takes two arguments: the labels for each bar, and the height of each bar. Both of these can be found in our DataFrame.

**Instructions**
- Create a bar chart of the column avg_hours_worked for each officer from the DataFrame hours.
- Use the column std_hours_worked (the standard deviation of the hours worked) to add error bars to the bar chart.

In [None]:
# Display the DataFrame hours using print
print(hours)

# Create a bar plot from the DataFrame hours
plt.bar(hours.officer, hours.avg_hours_worked,
        # Add error bars
        yerr = hours.std_hours_worked)

# Display the plot
plt.show()

![image-3.png](attachment:image-3.png)

**Exercise - Where did the time go?**

Officer Deshaun wants to compare the hours spent on field work and desk work between him and his colleagues. In this DataFrame, he has split out the average hours worked per week into `desk_work` and `field_work`.

You can use the same DataFrame containing the hours worked from the previous exercise (`hours`).

**Instruction:**
- Create a bar plot of the time each officer spends on desk_work.
- Label that bar plot "Desk Work".

In [None]:
# Plot the number of hours spent on desk work
plt.bar(hours.officer, hours.desk_work, label = "Desk Work")

# Display the plot
plt.show()

![image.png](attachment:image.png)

**Instructions**

- Create a bar plot for field_work whose bottom is the height of desk_work.
- Label the field_work bars as "Field Work" and add a legend.

In [None]:
# Plot the number of hours spent on desk work
plt.bar(hours.officer, hours.desk_work, label = 'Desk Work')

# Plot the hours spent on field work on top of desk work
plt.bar(hours.officer, hours.field_work,
        bottom  = hours.desk_work, label = 'Field Work')

# Add a legend
plt.legend()

# Display the plot
plt.show()

![image.png](attachment:image.png)

----

## Making a Histogram

**Histogram**
- visualizes the distribution of values in a dataset.

**Creating a Histrogram**
- place each piece of data into a bin

## Normalizing
- reduces the height of each bar by a constant factor so that the sum of the areas of each bar adds to one

![image.png](attachment:image.png)

**Exercise - Modifying histograms**

Let's explore how changes to keyword parameters in a histogram can change the output. Recall that:

- `range` sets the minimum and maximum datapoints that we will include in our histogram.
- `bins` sets the number of points in our histogram.

We'll be exploring the weights of various puppies from the DataFrame `puppies`. matplotlib has been loaded under the alias plt.

**Instructions**
- Create a histogram of the column `weight` from the DataFrame `puppies`.

In [None]:
# Create a histogram of the column weight from the DataFrame puppies
plt.hist(puppies.weight)

# Add labels
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')

# Display
plt.show()

![image.png](attachment:image.png)

**Instructions**

Change the number of bins to 50.

In [None]:
# Change the number of bins to 50
plt.hist(puppies.weight,
        bins = 50)

# Add labels
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')

# Display
plt.show()

![image.png](attachment:image.png)

**Instructions**

Change the range to start at 5 and end at 35.

In [None]:
# Change the range to start at 5 and end at 35
plt.hist(puppies.weight,
        range = (5, 35))

# Add labels
plt.xlabel('Puppy Weight (lbs)')
plt.ylabel('Number of Puppies')

# Display
plt.show()

![image.png](attachment:image.png)

> Increasing the number of bins made your plot spikier. Changing the range restricted the portion of the dataset that was plotted. Note that the parentheses around the minimum and maximum for range were required to make the code run.

**Exercise - Heroes with histograms**

We've identified that the kidnapper is Fred Frequentist. Now we need to know where Fred is hiding Bayes.

A shoe print at the crime scene contains a specific type of gravel. Based on the distribution of gravel radii, we can determine where the kidnapper recently visited. It might be:

![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

The radii of individual gravel pieces has been loaded into the DataFrame `gravel`, and matplotlib has been loaded under the alias plt.

**Instructions**
- Create a histogram of `gravel.radius`.
- Modify the histogram such that the histogram is divided into 40 bins and the range is from 2 to 8.
- Normalize your histogram so that the sum of the bins adds to 1.
- Label the x-axis (`Gravel Radius (mm)`), the y-axis (`Frequency`), and add the title (`Sample from Shoeprint`).

In [None]:
# Create a histogram
plt.hist(gravel.radius,
         bins=40,
         range=(2, 8),
         density=True)

# Label plot
plt.xlabel('Gravel Radius (mm)')
plt.ylabel('Frequency')
plt.title('Sample from Shoeprint')

# Display histogram
plt.show()

![image.png](attachment:image.png)