#Python Packages: What they are and how to use them

I introduced the idea of packages this morning. There are a large number of packages that enable python to do a variety of things. We'll introduce some basic ideas today and then we'll take a deeper look at data analysis and data visualization tomorrow.

Also for this one, I'm not going to use Google Slides or PowerPoint, I'll just do everything within Colab to keep things simpler.

We'll start with how to do some plotting. We'll use Matplotlib.pyplot. The original purpose of the pyplot module was to make Matplotlib more accessbile to former MATLAB users.  There are other ways of using it, but it is the most common way.

Parts of this Notebook are adapted from [Plot With Pandas: Python Data Visualization for Beginners](https://realpython.com/pandas-plot-python/#create-your-first-pandas-plot) by Reka Horvath.
There is also material from other sources, but I have followed the structure of this web page, for the pandas useage.  I am focusing on the aspects that are most useful to ASRI participants.  This site is a good resource. We'll use the same data sets as the web-site.

We'll do the direct downloads of data rather than putting them on our Google Drives.  This is a good way to get data into Google colab. I'll also show you how to use your Google Drive to read in downloaded data. I'll do it tdday if we have time, or tomorrow if we don't.

In [None]:
#We'll begin by importing the pyplot module 
import matplotlib.pyplot as plt
X = range(100)
Y = [value ** 2 for value in X]
plt.plot(X, Y)
plt.show()

Below I use Python's standard library.  We create a list T with numbers from 0 to 99 (100 point curve).  We find the x coordinates by rescaling the valus in Y so that  x goes from 0 to 2$\pi$, then we generate the y coordiantes. And then we do the plot.

In [None]:
import math as mt
T=range(100)
X=[(2*mt.pi*t)/len(T) for t in T]
Y=[mt.sin(value)for value in X]
plt.plot(X, Y)
plt.show()

I could use Numpy to do the same thing. I do exactly that in the codeblock below. I use the function np.linspace to create a one-dimensional array witn N evenly spaced entries. You don't specify the spacing. The syntax is:

np.linspace(start,end,num_steps)

Python determines how to get from the start to the end in the specified number of steps.


In [None]:
import numpy as np
X= np.linspace(0, 2*np.pi, 100)
Y=np.sin(X)
plt.plot(X, Y)
plt.show()

#Multiple Plots
This time we'll do both sine and cosine. pyplot and numpy are already loaded, but we'll load them again anyway.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
X= np.linspace(0, 2*np.pi, 100)
Ya=np.sin(X)
Yb=np.cos(X)
plt.plot(X, Ya)
plt.plot(X, Yb)
plt.show()


#How to Download Plots

Here, I'll show you the code that you need to download your plots to your computer. We'll use the same code as above and add the lines that enable the download.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
X= np.linspace(0, 2*np.pi, 100)
Ya=np.sin(X)
Yb=np.cos(X)
plt.plot(X, Ya)
plt.plot(X, Yb)
###The next three lines are specific to downloading an image from Google Colab
from google.colab import files
plt.savefig("sinecosine.png")
files.download("sinecosine.png") 
###### These lines must appear before the plt.show() command
plt.show()

# Let's plot some bar charts.  
pyplot is already avialble, but I'll load it again

In [None]:
import matplotlib.pyplot as plt
data= [5., 25., 50., 40.]
plt.bar(range(len(data)), data) 
plt.show()

For each value there is one bar, we gave the bar function two arguments, the x coordinate of each bar and its height, here the coordinates used are 0, 1, 2 etc. which is the purpose of range(len(data)).  By default a bar will have a thickness of 0.8 units.  we can remove the gaps by setting the width equal to one.

In [None]:
import matplotlib.pyplot as plt
data= [5., 25., 50., 20.]
plt.bar(range(len(data)), data, width=1.) 
plt.show()

In [None]:
import matplotlib.pyplot as plt
data= [5., 25., 50., 20.]
plt.bar(range(len(data)), data, width=.97) 
plt.show()

#Horizontal Bars

In [None]:
import matplotlib.pyplot as plt
data= [5., 25., 50., 20.]
plt.barh(range(len(data)), data) 
plt.show()

#Stacked Bar Charts

In [None]:
import matplotlib.pyplot as plt
A=[5., 30., 45., 22.]
B= [5., 25., 50., 20.]

X=range(4)

plt.bar(X, A, color='b') 
plt.bar(X, B, color='r', bottom=A) 
plt.show()

#Pie Charts

In [None]:
import matplotlib.pyplot as plt

data= [5, 25, 50, 20, 30, 10]
plt.pie(data)
plt.show()

# Plotting Histograms

we'll draw 1000 values from a normal distribution and generate a hisogram with 20 bins. Run it several times - it will change.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x=np.random.randn(1000)
plt.hist(x, bins=20)

plt.show()

#Labels, colors, etc.
##Colors

There are lots of ways of defining colors in matplotlib, I'm going to give you the simplest. Matplotlib will interpret standard HTML color names as an actual color. Some colors have an alias you can use instead, I will list these, color followed by alias - they aren't always obvious.

Blue - b

Green - g

Red - r

Cyan - c

Magenta - m

Yellow - y

Black - k

White - w

matplotlib will interpret a string representation of a floating point value as a shade of gray, 0.75 is a medium gray.


For those of you with some experience of this matplotlib will interpret HTML color strings as colors, these strins are defined as #RRGGBB where RR, GG, and BB are values for the red, green, and blue components in hexadcimal (8-bit values).  You will either understand the previous sentence or not.  I'm not giving an example of this.  Let's look at an example.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
def pdf (x, mu, sigma):
  a=1. / (sigma*np.sqrt(2.*np.pi))
  b=-1. / (2.*sigma**2)
  return a * np.exp(b*(x-mu)**2)

x=np.linspace(-6, 6, 1000)

for i in range(5):
  samples=np.random.standard_normal(50)
  mu, sigma =np.mean(samples), np.std(samples)
  plt.plot(x, pdf(x, mu, sigma), color = '.5')
  plt.plot(x, pdf(x, 0., 1), color='y')
  plt.show()

##Titles

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x=np.linspace(-4, 4, 1024)
y= .25*(x+4.)*(x+1.)*(x-2.)

plt.title('A polynomial')
plt.plot(x,y, c='b')

#LaTeX Titles

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x=np.linspace(-4, 4, 1024)
y= .25*(x+4.)*(x+1.)*(x-2.)

plt.title('$f(x)=\\frac{1}{4}(x+4)(x+1)(x-2)$')
plt.plot(x,y, c='m')

##Labeled axes with a title

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x=np.linspace(-4, 4, 1024)
y= .25*(x+4.)*(x+1.)*(x-2.)

plt.title('Power curve for airfoil')
plt.xlabel ('Air Speed')
plt.ylabel('Total drag')
plt.plot(x,y, c='b')

##Labels on the plot

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x=np.linspace(-4, 4, 1024)
y= .25*(x+4.)*(x+1.)*(x-2.)

plt.title('Power curve for airfoil')
plt.xlabel ('Air Speed')
plt.ylabel('Total drag')
plt.text(0.5,0.25, 'Minimum')
plt.text(-3.8,4., ' Local Maximum')
#the numbers are x and y coordinates of where the label starts
plt.plot(x,y, c='b')

##Placing the text in boxes

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x=np.linspace(-4, 4, 1024)
y= .25*(x+4.)*(x+1.)*(x-2.)

plt.title('Power curve for airfoil')
plt.xlabel ('Air Speed')
plt.ylabel('Total drag')

box = {
    'facecolor': '0.95',
    'edgecolor': 'r',
    'boxstyle': 'round'
}
plt.text(0.5,0.25, 'Minimum', bbox=box) 
plt.text(-3.8,4., ' Local Maximum', bbox=box)
#the numbers are x and y coordinates of where the label starts
plt.plot(x,y, c='b')

There is much, much more you can do, but this should be enough for now!

If you need more, [here](https://matplotlib.org/stable/tutorials/introductory/pyplot.html) is the official tutorial

#Now we'll look at plotting from pandas - starting with a way to load a .csv file into Colab

Pandas implements commands that are wrappers containing matplotlib commands.  One limitation of pandas is that the documentation usually lags behind what the preferred usage in the package. This is where I begin adapting [Plot With Pandas: Python Data Visualization for Beginners by Reka Horvath](https://realpython.com/pandas-plot-python/#create-your-first-pandas-plot) into Colab.  I've eliminated some of her explanations, and some of her content.

The advantage of this version is that you can change the code and see what happens!

In [None]:
import pandas as pd

download_url = ("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv" )

df = pd.read_csv(download_url)

type(df)

In the above, by calling read_csv(), we create a DataFrame, which is the main data structure used in pandas. I then used the type function to tell me about the object that I downloaded.

In [None]:
df.head()

Now try this, it makes no difference this time, but it prevents pandas from losing columns.

In [None]:
pd.set_option("display.max.columns", None)
df.head()

Here's a plot, .plot() returns a line graph containing data from every row in the DataFrame. The x-axis values represent the rank of each institution, and the "P25th", "Median", and "P75th" values are plotted on the y-axis.

In [None]:
df.plot(x="Rank", y=["P25th", "Median", "P75th"])

Looking at the plot, you can make the following observations:

The median income decreases as rank decreases. This is expected because the rank is determined by the median income.

Some majors have large gaps between the 25th and 75th percentiles. People with these degrees may earn significantly less or significantly more than the median income.

Other majors have very small gaps between the 25th and 75th percentiles. People with these degrees earn salaries very close to the median income.

This plot already hints that there’s a lot more to discover in the data! Some majors have a wide range of earnings, and others have a rather narrow range. To discover these differences, you could several other types of plots.

Properties of the plot()
.plot() has several optional parameters. Most notably, the kind parameter accepts eleven different string values and determines which kind of plot you’ll create:

"area" is for area plots. "bar" is for vertical bar charts. "barh" is for horizontal bar charts. "box" is for box plots. "hexbin" is for hexbin plots. "hist" is for histograms. "kde" is for kernel density estimate charts. "density" is an alias for "kde". "line" is for line graphs. "pie" is for pie charts. "scatter" is for scatter plots.

The default value is "line". Line graphs, like the one created above, provide a good overview of your data. You can use them to detect general trends. They rarely provide sophisticated insight, but they can give you clues as to where to zoom in.

If you don’t provide a parameter to .plot(), then it creates a line plot with the index on the x-axis and all the numeric columns on the y-axis.

As an alternative to passing strings to the kind parameter of .plot(), DataFrame objects have several methods that you can use to create the various kinds of plots described above:

.area() .bar() .barh() .box() .hexbin() .hist() .kde() .density() .line() .pie() .scatter()

When you call .plot() on a DataFrame object, Matplotlib creates the plot under the hood.

To illustrate this, first, create a plot with Matplotlib using two columns of your DataFrame:

In [None]:
import matplotlib.pyplot as plt

plt.plot(df["Rank"], df["P75th"])

We can create exactly the same graph using the DataFrame object’s .plot() method, .plot() is a wrapper for pyplot.plot(), and the result is a graph identical to the one you produced with Matplotlib.

You can use both pyplot.plot() and df.plot() to produce the same graph from columns of a DataFrame object. However, if you already have a DataFrame instance, then df.plot() offers cleaner syntax than pyplot.plot().

In [None]:
df.plot(x="Rank", y="P75th")

# Using your plots explore datasets

Next we'll get  a general overview of a specific column of your dataset. First, we’ll have a look at the distribution of a property with a histogram. Then we’ll intoduce some tools to examine the outliers.

##Distributions and Histograms
DataFrame is not the only class in pandas with a .plot() method. The Series object provides similar functionality.

You can get each column of a DataFrame as a Series object. Here’s an example using the "Median" column of the DataFrame created from the college major data

In [None]:
median_column = df["Median"]
type(median_column)

pandas.core.series.Series

Now that you have a Series object, you can create a plot for it. A histogram is a good way to visualize how values are distributed across a dataset. Histograms group values into bins and display a count of the data points whose values are in a particular bin.

To plot  a histogram for the "Median" column, you call .plot() on the median_column Series and pass the string "hist" to the kind parameter. 

The histogram shows the data grouped into ten bins ranging from $20,000 to $120,000, and each bin has a width of $10,000. The histogram has a different shape than the normal distribution which we used to introduce the historgram, the normal distribution has a symmetric bell shape with a peak in the middle.

In [None]:
median_column.plot(kind="hist")


Outliers
There is an outlier, a bin on the right edge of the distribution. It seems that one data point has its own category. The majors in this field get an excellent salary compared not only to the average but also to the runner-up. Although this isn’t its main purpose, a histogram can help you to detect such an outlier. Let’s investigate the outlier a bit more.

Which majors does this outlier represent? How big is its edge? Contrary to the first overview, you only want to compare a few data points, but you want to see more details about them. For this, a bar plot is an excellent tool. First, select the five majors with the highest median earnings. You’ll need two steps:

To sort by the "Median" column, use .sort_values() and provide the name of the column you want to sort by as well as the direction ascending=False.

To get the top five items of your list, use .head()

We'll create a dataframe called top_5

In [None]:
top_5 = df.sort_values(by="Median", ascending=False).head()

As a next step we create a bar plot that shows only the majors with these top five median salaries.

In [None]:
top_5.plot(x="Major", y="Median", kind="bar", rot=80, fontsize=10)

This plot shows that the median salary of petroleum engineering majors is more than $20,000 higher than the rest. The earnings for the second- through fourth-place majors are relatively close to one another.

If you have a data point with a much higher or lower value than the rest, then you’ll probably want to investigate a bit further. For example, you can look at the columns that contain related data.

Let’s investigate all majors whose median salary is above $60,000. First, you need to filter these majors with the mask df[df["Median"] > 60000]. Then you can create another bar plot showing all three earnings columns

In [None]:
top_medians = df[df["Median"] > 60000].sort_values("Median")
top_medians.plot(x="Major", y=["P25th", "Median", "P75th"], kind="bar")

##Investigating outliers is an important step in data cleaning.

##Check for correlation

Often you want to see whether two columns of a dataset are connected. If you pick a major with higher median earnings, do you also have a lower chance of unemployment? As a first step, create a scatter plot with those two columns.

In [None]:
df.plot(x="Median", y="Unemployment_rate", kind="scatter")

A quick glance at this figure shows that there’s no significant correlation between the earnings and unemployment rate.

A scatter plot is an excellent tool for getting a first impression about possible correlation, but it certainly isn’t definitive proof of a connection. For an overview of the correlations between different columns, you can use .corr(). If you suspect a correlation between two values, then you have several tools at your disposal to verify your hunch and measure how strong the correlation is.

Keep in mind, though, that even if a correlation exists between two values, it still doesn’t mean that a change in one would result in a change in the other. In other words, correlation does not imply causation.

#Analyze Categorical Data


Many datasets contain some explicit or implicit categorization. In the current example, the 173 majors are divided into 16 categories.

##Grouping
A basic usage of categories is grouping and aggregation. You can use .groupby() to determine how popular each of the categories in the college major dataset are.

In [None]:
cat_totals = df.groupby("Major_category")["Total"].sum().sort_values()
cat_totals

With .groupby(), you create a DataFrameGroupBy object. With .sum(), you create a Series.

We'll draw a horizontal bar plot showing all the category totals in cat_totals

In [None]:
cat_totals.plot(kind="barh", fontsize=10)

##Determining Ratios
Vertical and horizontal bar charts are often a good choice if you want to see the difference between your categories. If you’re interested in ratios, then pie plots are an excellent tool. However, since cat_totals contains a few smaller categories, creating a pie plot with cat_totals.plot(kind="pie") will produce several tiny slices with overlapping labels .

To address this problem, you can lump the smaller categories into a single group. Merge all categories with a total under 100,000 into a category called "Other", then create a pie plot. Notice that we include the argument label="". By default, pandas adds a label with the column name. That often makes sense, but in this case it would only add noise.

In [None]:
small_cat_totals = cat_totals[cat_totals < 100_000]
big_cat_totals = cat_totals[cat_totals > 100_000]
# Adding a new item "Other" with the sum of the small categories
small_sums = pd.Series([small_cat_totals.sum()], index=["Other"])
big_cat_totals = big_cat_totals.append(small_sums)
big_cat_totals.plot(kind="pie", label="")

#Zooming in on Categories
Sometimes we also want to verify whether a certain categorization makes sense. Are the members of a category more similar to one other than they are to the rest of the dataset? Again, a distribution is a good tool to get a first overview. Generally, we expect the distribution of a category to be similar to the normal distribution but have a smaller range.

Let's create a histogram plot showing the distribution of the median earnings for the engineering majors. We get a histogram that you can compare to the histogram of all majors from the beginning. The range of the major median earnings is somewhat smaller, starting at $40,000. The distribution is closer to normal, although its peak is still on the left. So, even if you’ve decided to pick a major in the engineering category, it would be wise to dive deeper and analyze your options more thoroughly.

In [None]:
df[df["Major_category"] == "Engineering"]["Median"].plot(kind="hist")