
## Diamonds Data

This data set contains the prices and other attributes of almost 54,000 diamonds. This dataset is available on Github in the [2014_data repository](https://github.com/cs109/2014_data) and is called `diamonds.csv`.  


## Reading in the diamonds data (CSV file) from the web

This is a `.csv` file, so we will use the function `read_csv()` that will read in a CSV file into a pandas DataFrame. 

Here is a table containing a description of all the column names. 

Column name | Description 
--- | --- 
carat | weight of the diamond (0.2–5.01)
cut | quality of the cut (Fair, Good, Very Good, Premium, Ideal)
colour | diamond colour, from J (worst) to D (best)
clarity | a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
table | width of top of diamond relative to widest point (43–95)
price | price in US dollars (\$326–\$18,823)
x | length in mm (0–10.74)
y | width in mm (0–58.9)
z | depth in mm (0–31.8)

## Reviewing summaries in Pandas

We just learned about `diamonds.describe()` above, what else can we do? 

## Defining your own functions

New functions can be defined using a built-in keyword in Python: `def`.  

The first line of the function (the header) must start with the keyword `def`, the name of the function (which can contain underscores), parentheses (with any arguments inside of it) and a colon.  The arguments can be specified in any order. 

The rest of the function (the body) always has an indentation of four spaces.  If you define a function in the interactive mode, the interpreter will print ellipses (...) to let you know the function isn't complete. To complete the function, enter an empty line (not necessary in a script).  

To return a value from a function, use `return`. The function will immediately terminate and not run any code written past this point.

#### The docstring
When defining new functions, you can add a `docstring` (i.e. the documentation of function) at the beginning of the function that documents what the function does. The docstring is a triple quoted (multi-line) string.  We highly recommend you to document the functions you define as good python coding practice. 

#### Lambda functions
Lambda functions are one-line functions. To define this function using the `lambda` keyword, you do not need to include the `return` argument.  For example, we can re-write the `squared()` function above using the following syntax:

## For loops and while loops

#### For loops
Defining a `for` loop is similar to defining a new function. The header ends with a colon and the body is indented with four spaces. The function `range(n)` takes in an integer n and creates a set of values from 0 to n - 1.  `for` loops are not just for counters, but they can iterate through many types of objects such as strings, lists and dictionaries. 

To traverse through all characters in a given string, you can use `for` or `while` loops. Here we create the names of the duck statues in the Public Gardens in downtown Boston: Jack, Kack, Lack, Mack, Nack, Oack, Pack, Qack. 

#### while loops
Defining a `while` loop is again similar to defining a `for` loop or new function. The header ends with a colon and the body is indented with four spaces. 

#### List comprehensions
Another powerful feature of Python is **list comprehension** which maps one list onto another list and applying a function to each element.  Here, we take each element in the list `a` (temporarily assigning it the value i) and square each element in the list. This creates a new list and does not modify `a`.  In the second line, we can add a conditional statements of only squaring the elements if the element is not equal to 10.

## Exploratory Data Analysis (EDA)

The variables `carat` and `price` are both continuous variables, while `color` and `clarity` are discrete variables. First, let's look at some summary statistics of the diamonds data set. 

Let's look at the distribution of carats and price using a histogram.  Pandas has a histogram method that can be used with any panda object.  You can define the number of bins and color.

You can also change the bins and figure size as well as add titles and labels.  Note that the `plt.___` commands are from the matplotlib library which you imported above as 

```
import matplotlib.pyplot as plt
```

And this imported the pyplot library in in the matplotlib package as the alias `plt`.  To access methods within matplotlib, you now can use the alias `plt`.  Also note, how pandas communicates well with matplotlib.



One can also plot the density of a distribution in pandas.  More documentation on density plots can be found [here](www.google.com).  

Let's plot the density of the price of the diamonds.

To change the type of plot of a pandas object, we can change the input for the `kind` argument, as shown below.  Let's look at the relationship between the price of a diamond and its weight in carats. Try changing alpha (ranges from 0 to 1) to control over plotting.  What changes?

We can also create a scatter plot using matplotlib.pyplot instead of pandas directly.

Let's look at the scatter plots of `price` and `carat` but grouped by color.  

What happens if you look at the scatter plots of `price` and `carat` but grouped by clarity.  

We could also look at boxplots of the `price` grouped by `color`.  