# Section 1: Importing and Summarizing Data

You can type Python code in the code cells following the instruction cells to solve the exercises. 

You can also use the IPython Shell interactively. Go to View (on the menubar) and click terminal or press `Ctrl` + ` to open the terminal. Once terminal is open, type py and hit Enter. This will launch interactive python cell. You can write python statements and hit Enter to see the output in the next line in the cell.

## Exercise 1: Read and explore your data

In this lab, you'll explore a dataset containing information on a university's recent graduates for each department stored in `"recent_grads.csv"` file. In this exercise, you'll read in this data using Python's `pandas` module.
### Instructions

- Import `pandas` as `pd`.
- Read in the data from `recent_grads_url` (which is a CSV file) and assign it to a variable called `recent_grads`.
- Print the shape of `recent_grads`.


In [None]:
# Import pandas 


# Use pandas to read in recent_grads.csv
recent_grads = ____('recent_grads.csv')

# Print the shape
print(____)

## Exercise 2: Exploring Your Data

Now you'll perform some data exploration using the Python `pandas` module. To get a sense of the data, you'll output statistics such as mean, median, count, and percentiles.

The DataFrame `recent_grads` is still in your workspace.
### Instructions

- Print the `.dtypes` of your data so that you know what each column contains.
- Output basic summary statistics using a single `pandas` function.
- With the same function from before, summary statistics for all columns that aren't of type `object`.



In [None]:
# Print .dtypes
print(____)

# Output summary statistics
print(____)

# Exclude data of type object
print(____)

## Exercise 3: Replacing Missing Values

There are some missing values in the dataset that are coded as a string. You'll update these to a value that Python understands as "missing."

The list `columns` contains the names of the columns you'll be working with in this exercise.
### Instructions

- Look at the `dtypes` of the columns in `columns` to make sure that the data is numeric.
- It looks like a string is being used to encode missing values. Use the `.unique()` method to figure out what the string is.
- Search for missing values in the median, p25th, and p75th columns.
- Replace the found missing values with a `NaN` value, using `numpy`'s `np.nan`.

In [None]:
import numpy as np
# Names of the columns we're searching for missing values 
columns = ['median', 'p25th', 'p75th']

# Take a look at the dtypes
print(____)

# Find how missing values are represented
print(recent_grads["median"].____)

# Replace missing values with NaN
for column in ___:
    recent_grads.loc[____ == '____', column] = ____

## Exercise 4: Select a Column

Python's `pandas` module allows you to select a specific column from a DataFrame, which is especially useful for when you only need to manipulate one piece of data. In this exercise, you'll select the `sharewomen` column, which shows the percentage of women for a given department.

The DataFrame `recent_grads` is still in your workspace.
### Instructions

- Select the `sharewomen` column from `recent_grads` and assign this to a variable named `sw_col`.
- Output the first 5 rows of `sw_col`.


In [None]:
# Select sharewomen column
sw_col = ____

# Output first five rows
print(____)

## Exercise 5: Column Maximum Value

Now that you've selected the `sharewomen` column, you'll use `numpy` to output its maximum value.

The variable `sw_col` you created in the last exercise is still available in your workspace.
### Instructions

- Import `numpy` as `np`.
- Using a `numpy` built-in function, find the maximum value of the `sharewomen` column and assign this value to the variable `max_sw`.
- Print the value of `max_sw`


In [None]:
# Import numpy


# Use max to output maximum values
max_sw = ____

# Print column max
print(____)

## Exercise: Selecting a Row

While you know what the maximum percentage of women in a department is, which department is this? You'll output this information by filtering the dataset with `pandas`.

The variables `sw_col` and `max_sw` are still in your workspace.
### Instructions

- Output the row of data for the department that has the largest percentage of women.


In [None]:
# Output the row containing the maximum percentage of women
print(____)

## Exercise 6: Converting a DataFrame to Numpy Array

Since `numpy` is such a powerful Python module, this exercise asks you to convert a `pandas` DataFrame to a `numpy` array to then utilize a statistics metric available through `numpy` in the next exercise.
### Instructions

- Select the columns `unemployed` and `low_wage_jobs` from `recent_grads`, then convert them to a `numpy` array. Save this as `recent_grads_np`.
- Print the type of `recent_grads_np` to see that it is a `numpy` array.


In [None]:
# Convert to numpy array
recent_grads_np = ____


# Print the type of recent_grads_np
print(____)

## Exercise 7: Correlation Coefficient

You have some suspicion that there's a relationship between the `low_wage_jobs` and `unemployment_rate` columns, so you decide to use `numpy` to calculate the correlation coefficient.
### Instructions

- Calculate the correlation matrix of the `numpy` array `recent_grads_np`.


In [None]:
# Calculate correlation matrix
print(np.corrcoef(____))

# Section 2: Manipulating Data

## Exercise 1: Creating Columns I

If you look at the dataset, you'll notice that while there's a column which shows the percentage of women in each department, there is no column which shows the percentage of men.
### Instructions

- Create a new column named `sharemen`, that contains the percentage of men for a given department by dividing the number of men by the total number of students for each department.


In [None]:
# Add sharemen column
recent_grads['sharemen'] = ____

## Exercise 2: Select Row with Highest Value

Remember how you found the row of data with the highest percentage of women? Now you'll find the row that corresponds to the department with the highest rate of men.

The module `numpy` has been imported under the alias `np` for you.
### Instructions

- Using numpy, find the maximum value for the percentage of men and call this variable max_men.
- Select the row that has the percentage of men which corresponds to max_men.


In [None]:
import numpy as np
# Find the maximum percentage value of men
max_men = ____
 
# Output the row with the highest percentage of men
print(____)

## Exercise 3: Creating columns II

Eventually you want to figure out which departments are most balanced between men and women. To accomplish this, you'll add a new column that reports the difference in percentages between men and women.
### Instructions

- Add a column named `gender_diff` that reports how much higher the rate of women is than the rate of men.


In [None]:
# Add gender_diff column
recent_grads['gender_diff'] = ____

## Exercise 4: Updating columns

Your data for the `gender_diff` column currently consists of negative and positive values, which depend on which group of people (women or men) have a higher percentage. You want to find the five departments with the most balanced gender ratios, but first you decide to make your life easier by replacing the values in the `gender_diff` column with their respective absolute values.
### Instructions

- Use `numpy` and `pandas` to convert each value in the `gender_diff` column to its absolute value.
- Output the five departments with the most balanced gender ratios.


In [None]:
# Make all gender difference values positive
recent_grads['gender_diff'] = ____

# Find the 5 rows with lowest gender rate difference
print(____)

## Exercise 5: Filtering rows

Finally you can filter out for departments which fail the benchmark of a difference of more than 0.30. Since all the values are now positive, you can do this with a simple boolean operator.

You want to find the rows containing departments that are skewed heavily towards men. Using work you've already done, you'll create a new DataFrame that contains this information.

The DataFrame `recent_grads` still has the columns `sharemen` and `gender_diff` that you created in previous exercises.
### Instructions

- Create `diff_30`, a boolean `Series` that is `True` when the corresponding value of `gender_diff` is greater than `0.30` and `False` otherwise.
- Make another boolean Series called `more_men` that's true when the corresponding row in `recent_grads` has more men than women.
- Combine your two `Series` to make one that you can use to select rows that have both more men than women and a value of `gender_diff` greater than 0.30. Save the result as `more_men_and_diff_30`.
- Use this new boolean `Series` to create the DataFrame `fewer_women` that contains only the rows you're interested in.

In [None]:
# Rows where gender rate difference is greater than .30 
diff_30 = ____ > .30

# Rows with more men
more_men = ____ > ____

# Combine more_men and diff_30
more_men_and_diff_30 = np.logical_and(____)

# Find rows with more men and and gender rate difference greater than .30
fewer_women = ____

## Exercise 6: Grouping with Counts

There are various department categories but no sense of how many departments there are in each category. You'll use `pandas` to gain insight into this information.

In particular, you'll use the `.groupby()` method of `pandas`. This was not introduced to you in the course, but you'll see it very frequently in your data science journey and it's an important method to understand. This set of exercises will extend your `pandas` knowledge by teaching you how to use the `.groupby()` method.

Calls to `.groupby()` have the following three components: the column you want to group, the column you want to aggregate, and the statistic you want to aggregate by. For example, in our dataset, if we wanted to see the percentage of women (`'sharewomen'`) per `'major_category'`, we could leverage a `.groupby` like so: `recent_grads.groupby('major_category')['sharewomen'].mean()`. Here, we are grouping by `'major_category'`, and aggregating `'sharewomen'` by the mean. Give it a try in the IPython Shell if you're curious to see the result!
### Instructions


- Using the `.groupby()` method, group the `recent_grads` DataFrame by `'major_category'` and then count the number of departments per category using `.count()`.



In [None]:
# Group by major category and count
print(recent_grads.____(['____']).major_category.____)

## Exercise 7: Grouping with Counts, Part 2

You want to get a sense of the number of majors with a lot less women, so you'll perform a similar operation as the one from the last exercise.

Use the `fewer_women` DataFrame from a previous exercise.
### Instructions

- Create a DataFrame that groups the departments by major category and shows the number of departments that are skewed in women.



In [None]:
# Group departments that have less women by category and count
print(____)

## Exercise 8: Grouping One Column with Means

Similar to the exercise you just completed, you can group rows to output the means of different groups within a column.

The column `gender_diff` is still available in the `recent_grads` DataFrame.
### Instructions

- Write code that outputs the average gender percentage difference by major category.


In [None]:
# Report average gender difference by major category
print(____)

## Exercise 9: Grouping Two Columns with Means

You can expand the previous operation to include two columns and output the means for each. To accomplish this, modify the code you just wrote.
Instructions

- Write a query that outputs the mean number of `'low_wage_jobs'` and average `'unemployment_rate'` grouped by `'major_category'`.


In [None]:
# Find average number of low wage jobs and unemployment rate of each major category
dept_stats = ____.____(['____'])['____', '____'].____
print(dept_stats)

# Section 3: Visualizing Data

## Exercise 1: Plotting Scatterplots

Now that you've calculated the correlation coefficient between the `low_wage_jobs` and `unemployment_rate` columns, you want to create a visualization to effectively display this relationship. You'll use `matplotlib` to create a scatterplot of these two columns.

The DataFrame `dept_stats` is available in your workspace again, and the columns `low_wage_jobs` and `unemployment_rate` have been extracted into variables of the same name.
### Instructions

- Import `matplotlib.pyplot` with the alias `plt`.
- Create a scatter plot between `unemployment_rate` and `low_wage_jobs` per major category.
- Label the x axis with `'Unemployment rate'`.
- Label the y axis with `'Low pay jobs'`.



In [None]:
low_wage_jobs = recent_grads['low_wage_jobs']
unemployment_rate = recent_grads['unemployment_rate']
# Import matplotlib


# Create scatter plot
plt.____

# Label x axis
plt.____

# Label y axis
plt.____

# Display the graph 
plt.show()

# Exercise 2: Modifying Plot Colors

The default settings for `matplotlib` may not be what you hope to present to others, so you decide to customize your plot for low wages versus unemployment rate.

Use the `pandas` DataFrame `dept_stats` again.
### Instructions

- Create the scatterplot visualization between the unemployment rate and number of low wage jobs per major category using the `.plot()` method.
- Customize this scatterplot so that the points are red triangles by setting the `color` argument to `"r"` and the marker argument `^`.
- Display the plot you've created!



In [None]:
# Plot the red and triangle shaped scatter plot  
plt.____

# Display the visualization


## Exercise 3: Plotting Histograms

Now that you've taken a look at that scatterplot, you want to go back to the `sharewomen` column that you were working with earlier. Specifically, you want to get an idea of how the values of `sharewomen` are distributed. This means you want to plot a histogram. For your convenience, the `sharewomen` column has been extracted from the recent_grads DataFrame into a variable called `sharewomen`.

### Instructions

- Use `matplotlib` to create a histogram of `sharewomen`.
- Show the plot you created.


In [None]:
sharewomen = recent_grads['sharewomen']
# Plot a histogram of sharewomen


# Show the plot


## Exercise 4: Plotting with pandas

In Python, there are several different ways to create visualizations. In fact, `pandas` has its own visualization capabilities, all of which are built on top of `matplotlib`! For example, you could have created the histogram from the previous exercise using `recent_grads.sharewomen(kind="hist")` instead of `plt.hist(recent_grads.sharewomen)`.

Which approach you prefer comes down to personal preference - when working with DataFrames, it is advantageous to use `pandas`' plotting capabilities because the code tends to be less verbose.

Here, you will practice creating the plots from the previous exercises using `pandas` instead of `matplotlib`. All `pandas` plots are created using the `.plot()` method on a DataFrame. Inside `.plot()`, you can specify which plot you want to create using the `kind` parameter. For example, `kind= 'hist'` would create a histogram, `kind='scatter'` would create a scatter plot, and so on.
### Instructions
- Use the `.plot()` method with `kind='scatter'` on the `dept_stats` DataFrame to create a scatter plot with `'unemployment_rate'` on the x-axis and `'low_wage_jobs'` on the y-axis.

- Now, create a histogram of the `sharewomen` column of the `recent_grads` DataFrame. 

In [None]:
# Import matplotlib and pandas
import matplotlib.pyplot as plt
import pandas as pd

# Create scatter plot
dept_stats.____(kind='____', x='____', y='____')
plt.show()

# Create histogram
recent_grads.sharewomen.plot(kind='____')
plt.show()

## Congratulations

You did it! You have successfully completed the exercises. 

Save your work by pressing `Ctrl + s` and head over to the next video in the course.

Don't forget to Turn Off the Lab. Remember - Don't shut down the machine, but click on the x icon on top right to close the lab session.