<a href="https://colab.research.google.com/github/GuillermoFidalgo/Python-for-STEM-Teachers-Workshop/blob/master/notebooks/15-Intro%20to%20Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Visualization**
Now we shall discuss how to make the following types of plots with Python:
1. line graph
2. scatter plot
3. histograms

We will use the matplotlib library to import the functions needed to plot our graphs. We can do so in two ways... 

1. Speficically calling the functions you want
```python
from matplotlib.pyplot import plot, show
#later in the code
plot()
show()
```


2. import all the functions and call them later
```python
import matplotlib.pyplot as plt
#later in the code
plt.plot()
plt.show()
```

 

## Line Graph
Useful when we want to see how a dependent variable (some value) changes with respect to another.

To detailed plots we can add arguments to the **plot()** function and use more functions from the matplotlib library

* We can give **plot()** two arguments!
  * The first argument will be the x-axis values and the second, the y-axis values


* We can change the color of the line and how the graph is drawn by adding another argument to plot() like so:
```python
plot(x,y,"[letter][symbol]")
```
    - we write a **letter** to indicate the color and a **symbols** to indicate the curve style.
  
      The letter (color) options are:
        1. r (red)
        2. g (green)
        3. b (blue); it's the default
        4. c (cyan)
        5. m (magenta)
        6. y (yellow)
        7. k (black)
        8. w (white)

      Some options for curve style are:
        1. "-" (solid line); it's the default 
        2. --  (dashed lines)
        3. o   (circle-points, but not connected with lines)
        4. .   (smaller-circle-points, not connected with lines)
        5. s   (square-marks)

* By calling other functions, we can also:
  * give the figure a title
```python
plt.title("A title")
```
  * give the axes labels
    - import the **xlabel()** and **ylabel()** frunctions from matplotlib and write:
```python
plt.xlabel("this is a label", fontsize = #)
```
  * set axes limits
    - import the **xlim()** and **ylim()** functions from matplotlib and write:
```python
plt.xlim(start #, end #)
```
    - A cool detail is if you want to invert the axes, you just provide a larger starting number than the ending number
  * add a legend
    - import **legend** from matplotlib and write:
```python
plt.legend(['data1','data2'], loc = "specify a location")   
```
    - Examples for loc = "upper right", "lower left", "center", etc. 
  * change the figure's settings by calling the **figure** function:
    - It is important to note that **plt.figure** goes before **plt.plot()**
    - size: 
```python
plt.figure( figsize=(float#,float#) )
```
    - change background's color
```python
plt.figure(facecolor="k")
```  
    - change edge's color
```python
plt.figure(edgecolor="g") 
```
    - change the resolution in dots-per-inch
```python
plt.figure(dpi=100)
```

In [None]:
# Making a better graph
import matplotlib.pyplot as plt
 
# Our data
x = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
y = [0,3,5,5,6,6,14,21,23,31,39,51,64,79,100,127,174,239,286,316,378,452,475,513,573,620,683,725,788,897,903]
 
# Change the figure's look
plt.figure(figsize=(4,3), dpi=200)
 
# Plot the data
plt.plot(x,y,"r.")
 
# Edit what you like...
# Figure title and axes lables
plt.title("CoronaVirus in Puerto Rico")
plt.xlabel("May 2020")
plt.ylabel("People sick")
plt.legend(['People sick'])
 
plt.show()

## Scatter Plots
Useful to see how two dependent variables can relate to each other.


* You have to call the **scatter** function from matplotlib
```python
import matplotlib.pyplot as plt
#later...
plt.scatter(x,y)
plt.show()
```
* **scatter** does not take another argument for the plot style, since it is set to be a scatter plot
* Can take the same **plt.figure()** arguments to change your graph's look 


## Loading data from Google Drive



For the next plot, we shall use a method to load a dataset that is stored in a Google Drive Folder. For it, we will use the **drive** function from the **google.colab** library. 


Additionally, the **loadtxt** function from the **numpy** library will be used to load the text file we shall use.

The code would go like this...
```python
from google.colab import drive
 
# Downloading the data from Google Drive
drive.mount('/content/drive')
data_in_drive ="/content/drive/MyDrive/Colab Notebooks/STEM Workshop/Feb 2021/Data/stars.txt"
```

Mounting the drive will ask you for a password that will be given to you once you sign in with a Google account.

Let's plot...

In [None]:
# Import PyDrive and associated libraries.
# This only needs to be done once per notebook.
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once per notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Download a file based on its file ID.
#
# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz
file_id = '1a0RvN4vy__iqejITbRokfGHe0PkqxwpC'
downloaded = drive.CreateFile({'id': file_id})
# print('Downloaded content "{}"'.format(downloaded.GetContentString()))

In [None]:
title=downloaded['title']
downloaded.GetContentFile(title)

## Loading data with pandas or numpy

In [None]:
# For plotting
import matplotlib.pyplot as plt
# For downloading the data we'll plot
from numpy import loadtxt


#if it doesn't work change the path below to the appropiate place for you
path ="/content/stars.txt"
# Or
# path='https://raw.githubusercontent.com/GuillermoFidalgo/Matplotlib-SWC/main/data/stars.txt'

# Converts the data as a list of float values
data = loadtxt(path, float)
 
# Slice to assing the variable everything from the first column
# This is the stars' temperature
T = data[:,0]   
 
# Slice to assign the varible everything from the second column
# This is the stars' absolute magnitude
M = data[:,1]   
 
# Setting the background
plt.style.use('dark_background')
 
# Change the figure's look
plt.figure(figsize=(6,4), dpi=160)
 
# Plot the data
plt.scatter(T, M, s=3, cmap = "RdYlBu", c=T)
 
plt.title("The Hertzsprung-Russel Diagram", fontsize=8)
plt.xlabel("Temperature")
plt.ylabel("Magnitude")
plt.xlim(13500,0)
plt.ylim(20,-5)
 
plt.show()
plt.style.use('default')

## Histograms
Useful to study the distribution of your data.
* Can be created by calling the **hist()** function from matplotlib
  * It's first argument is your sequence to plot
  * The second argument is the binning 
    * Bins divides your data into "buckets" of the specified width
    * Example: say you have a data that ranges from 0 to 10. If you bin your data by 2, it makes 5 "buckets"  to distribute your data...
 
      (0-2), (2-4), (4-6), (6-8), (8-10)

In [None]:
# We are going to need the pandas library to analyze our data
import pandas as pd
 
# We are going to use the read_csv function from pandas to call our data file
data = pd.read_csv('https://raw.githubusercontent.com/GuillermoFidalgo/Python-for-STEM-Teachers-Workshop/master/data/Dimuon_DoubleMu.csv')
 
# We can look at our data's content by using the built-in function head()
data.head()

The invariant mass (M) is an important quantity that allows us to indentify what particle we are looking at.

In [None]:
invariant_mass = data['M']
 
invariant_mass

Let's now do the histogram of the invariant mass

In [None]:
import matplotlib.pyplot as plt
 
plt.hist(invariant_mass)
 
plt.xlabel('Invariant mass [GeV]')
plt.ylabel('Number of events')
plt.title('The histogram of the invariant masses of two muons \n') 
 
plt.show()

In [None]:
# Let's specify the bins
 
plt.hist(invariant_mass, bins=500)
 
plt.xlabel('Invariant mass [GeV]')
plt.ylabel('Number of events')
plt.title('The histogram of the invariant masses of two muons \n') 
 
plt.show()

In [None]:
# Let's better our plot's resolution
 
plt.figure(dpi=150)
plt.hist(invariant_mass, bins=500)
 
plt.xlabel('Invariant mass [GeV]')
plt.ylabel('Number of events')
plt.title('The histogram of the invariant masses of two muons \n') 
 
plt.show()

In [None]:
# Let's find the best range and bins combination
 
plt.figure(dpi=100)
plt.hist(invariant_mass, bins=200, range=[0,120])
 
plt.xlabel('Invariant mass [GeV]')
plt.ylabel('Number of events')
plt.title('The histogram of the invariant masses of two muons \n') 
 
plt.show()

In [None]:
# Let's make the x and y axis be in a log scale 
# and give it a "step" type of look (which is basically a line instead of a block)
# I"ll also add more bins if needed

plt.figure(dpi=100)
plt.hist(invariant_mass, bins=500, range=[0,120],histtype="step")
 
plt.xlabel('Invariant mass [GeV]')
plt.ylabel('Number of events')
plt.title('The histogram of the invariant masses of two muons \n')
plt.yscale("symlog")
plt.xscale('log')
plt.show()

That doesn't look quite right. Let's make the bins spaced logarithmically with numpy

In [None]:
import numpy as np
a,b = .5,150
logbins= np.logspace(np.log10(a),np.log10(b),num=500)
plt.figure(dpi=100)
plt.hist(invariant_mass, bins=logbins,histtype="step")
 
plt.xlabel('Invariant mass [GeV]')
plt.ylabel('Number of events')
plt.title('The histogram of the invariant masses of two muons \n')
plt.yscale("log")
plt.xscale('log')
plt.show()

We can compare our plot with the one provided by the CERN Open Data analysis.

<!-- ![](https://root.cern/doc/master/pict1_df102_NanoAODDimuonAnalysis.C.png) -->

In [None]:
from IPython.display import Image
Image("https://root.cern/doc/master/pict1_df102_NanoAODDimuonAnalysis.C.png",width=600)