# Project Name

In this project you will be looking for insights in a dataset by creating aggregated bar charts, and distribution plots with Seaborn's `sns.boxplot()` and `sns.violinplot()`.

The dataset is from the <a href = "http://apps.who.int/gho/data/node.main.WSHWATER?lang=en" target = "_blank"> World Health Organization </a> and it contains information on basic and safely managed drinking water services 
by country. 

Some of the steps below will have hidden hints that you can access if you need them. Hints will look like this:

Hint:<font color=white> Great job, you didn't even need a hint to find the hint!</font>

The actual hint follows the word 'Hint:' and is in white text. To reveal the hint drag your cursor across the text to highlight and reveal it. Try highlighting the entire hint above to see how they work.

**A Note On `plt.show()`:** You may be used to displaying your plots with the code `plt.show()`; when using this IPython Jupyter notebook an inline backend has been invoked removing the necessity of calling show after each plot. You should be able to render your Seaborn plots simply by running the cell with the code for your plot. 

If you have issues rendering your plot you can either try adding `plt.show()` to a cell, or you can manually invoke the inline backend by adding the following line of code to the cell where you import python modules.

`%matplotlib inline`

## Step 1 Import Python Modules
Import the modules that you'll be using in this project:
- `from matplotlib import pyplot as plt`
- `import pandas as pd`
- `import seaborn as sns`

## Step 2 Ingest The Data
Load **clean_water.csv** into a DataFrame called `df`. Then, quickly inspect the DataFrame using `.head()`.

Hint:<font color=white> Use pd.read_csv()</font>

## Step 3 Examine The Data

It may be easier to understand the dataset if you look at the raw CSV file on your local machine. You can find **clean_water.csv** in the project download folder.

### Overview of Dataset:

The data values themselves represent the percent **(%)** of a countries population using at least basic drinking-water services.

For more information on how the metric is determined see the <a href = "http://apps.who.int/gho/data/node.wrapper.imr?x-id=4818" target = "_blank"> "Indicator Metdata Registry".</a>

Print the first 50 rows of `df` using `.head()`

### Notice:
For each of the five countries represented there is data for the years 2000-2015, with each years having three values: rural, urban, and the total percentage of people using at least basic drinking-water services.

## Step 4 Barplots

Make a bar plot with Seaborn to see how each of the five countries stack up in rural, urban, and total areas having clean drinking water.

We have set up a figure for you to plot on. Use `sns.barplot()` with the following arguments:

- `x` set to `country`
- `y` set to `value`
- hue set to `area`
- data set to `df`

In [2]:
f, ax = plt.subplots(figsize=(25, 15))


### Styling With Color Palette

We are interested in comparing and looking for patterns amongst the different areas within distinct countries so a qualitative color palette will probably be most effective in visualizing this data for our purposes. You can read more about <a href = "https://seaborn.pydata.org/tutorial/color_palettes.html#qualitative-color-palettes" target = "_blank">qualitative color palettes in the Seaborn documentation.</a>

### Adding Meaningful Names To Axes

If you look at the y-axis of the bar plot above, you will see the label is the same as the column name in our data: `value`. While this is a good name for coding with the DataFrame, it is not descriptive enough to be helpful as a marker in the visualization of the data. The label for axes can be manually overridden using `ax.set()`, which you can pass an argument for `xlabel` and or `ylabel`.

A) Set a different color palette using `sns.set_palette()`. You can use any of the Color Brewer qualitative color palettes:

- Set1
- Set2
- Set3
- Pastel1
- Pastel2
- Dark2
- Accent


B) Plot a barplot but this time setting `sns.barplot()` equal to `ax` so you can access the axes properties using `ax.set()` on the next line of code.

C) Rename the y-axis label something more descriptive than `value`. For example: `(%) of population using clean water`
 
Hint:  <font color=white>Use ax.set(ylabel="descriptive label") </font>

In [3]:
# set color palette

# create figure and axes
f, ax = plt.subplots(figsize=(25, 15))

# barplot goes here

# set new ylabel here


## Step 5 Boxplots

According to the Seaborn documentation:
> "A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. "


Make a boxplot to compare the distribution of rural, urban, and total clean drinking water by country.

We have set up a figure for you to plot on. Use `sns.boxplot()` with the following arguments:

- `x` set to country
- `y` set to value
- `hue` set to area
- `data` set to df

As a reminder, `sns.boxplot()` can take the same parameters as `sns.barplot()` in exactly the same way. In fact, the only difference between the code for this box plot and the bar plot above will be the `.boxplot()` instead of `.barplot()`

*Optional: You may set a new color palette if you would like using `sns.set_palette()`.*


In [4]:
# f, ax = plt.subplots(figsize=(16, 10))

# ax.set(ylabel="(%) of population using clean water")

## Step 6 Update DataFrame

Notice in the plot above that the United States distribution is barely visible. That is because nearly 100% of the population across urban and rural areas in the Unites States have clean drinking water for all the years the data spans. 

Let's take the United States out of the DataFrame and make another box plot so we can focus on the countries with a more larger distribution.

To remove any row that has a value of United States of America for country:

Filter the DataFrame `df` and assign it back to the original DataFrame using the syntax:
`df=df[<filter>]` 

In the place of `<filter>` use Pandas to create a filter so that `df` is a copy of itself except for the rows where the value for  `country` is `United States of America`.



Hint: <font color=white> df= df[df["country"] != "value to exclude"]</font>

### Replot 
Now, create the same box plot as above with the revised DataFrame, and add a little style with `sns.set_context()` and `sns.set_palette()`.

- Set the context to `poster`
- Set the palette to `Paired`
- Create a `boxplot()` with:
    - `x` set to country
    - `y` set to value
    - `hue` set to area
    - `data` set to df

In [6]:
f, ax = plt.subplots(figsize=(16, 10))
# set context here
# set palette here

ax.set(ylabel="(%) of population using clean water")

### Notice:
Zimbabwe has the biggest difference in the distributions of the rural and urban categories. Over 90% of the urban population of Zimbabwe has had access to clean drinking water over the years 2000-2016, while less than 60% of the rural population has. 


## Step 7 A Different Boxplot View Of The Data 

So far we have been visualizing the distribution of each country for the rural, urban, and total values. Another way we might want to visualize the data is to show the distribution of each of the three areas by country.

At the top of the cell below, set the context of the plot to `poster`.

Then, below the line of code that creates the figure, create a boxplot that shows the distribution of each of the three areas by country, with `area` on the x-axis, `value` on the y-axis, and `hue` set to `country`. `data` will still be equal to `df`.

*Optional: You can also set the palette for a box plot by passing the `palette` argument to `sns.boxplot()`. Try adding `palette="Dark2"` as the last argument in `sns.boxplot()`.*

In [7]:

# you can change the background style easily.
# sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(16, 10))

# box plot goes here


ax.set(ylabel="(%) of population using clean water")

How does this plot visualize the same data differently than the one above?

## Step 8 Violin plots


Another way to visualize distribution data is with a violin plot. We could make the box plot above into a violin plot simply by changing `.boxplot()` to `.violinplot()`.

Copy and paste the line of code used to create the box plot above into the cell below. Change the line of code so that you create a violin plot, instead of a box plot.

`sns.barplot(x="area", y = "value", hue = "country", data=df, palette="Set2")`

In [8]:
# uncomment the line of code below to reset the background and remove the grid lines
# sns.set_style("white")
sns.set_context("poster")
f, ax = plt.subplots(figsize=(18, 15))

ax.set(ylabel="(%) of population using clean water")

### Step 9 Focus In On Mexico

A violin plot might be more useful if we focus in on the data for one country.

Use Pandas to create a DataFrame named `df_mexico` that is equal to only the rows of `df` where the value in the `country` column is equal to `Mexico`.

Hint: <font color=white>new_dataframe = df[df["country"] == "Target"]</font>

Create a violin plot that visualizes the rural, urban, and total values for just Mexico.  


Hint: <font color=white>country should be on the x-axis, value on the y-axis, hue should be set to area, and data should equal your new DataFrame for just Mexico. </font>

In [9]:
f, ax = plt.subplots(figsize=(16, 10))
sns.set_context("poster")
sns.violinplot(x="country", y = "value", hue = "area", data=df_mexico, palette="Set2")

ax.set(ylabel="(%) of population using clean water")

## You're done! Congratulations!

### How do you feel?

## Bonus!

We could isolate the data even further and plot just the rural and urban values for Mexico, together in a split violin plot.

To do this we would first use Pandas to filter out the rows of the `df_mexico` DataFrame where `area` is equal to `Total`. It would look like this:

`df_mexico = df_mexico[df_mexico["area"] !="Total"]`

You can do try this by running the code in the cell below.

In [10]:
df_mexico = df_mexico[df_mexico["area"] !="Total"]
df_mexico.head()

Next we would use `sns.violinplot()` exactly as we did before, but we would add the argument `split = True`, to create a split violin plot.

Run the code in the cell below to see a split violin plot.

In [11]:
f, ax = plt.subplots(figsize=(16, 10))
sns.set_context("poster")
sns.violinplot(x="country", y = "value", hue = "area", data=df_mexico, split = True, palette="Set2")

ax.set(ylabel="(%) of population using clean water")

### Bonus Question:
What does the visualization of Mexico's urban vs. rural access to clean drinking water reveal? 

## Ok now you're really, all the way done! Great work!