# Jupyter Notebook

Notebooks are workspaces where you can write and execute code, write some notes, insert images, videos etc. Super useful tools for data analysts to document their analysis process.

# 🐍 Python reminder

## Data types

### Numeric data

In [1]:
# WHOLE NUMBERS AKA INTEGERS
45

45

In [2]:
# Can check type of data 
type(45)

int

In [3]:
# DECIMALS AKA FLOATS
3.14

3.14

In [4]:
type(3.14)

float

### Text

In [5]:
# TEXT AKA STRING
"I'm a scientist"

"I'm a geologist"

In [6]:
type("I'm a scientist")

str

### Booleans

In [7]:
# BINARY TRUE OR FALSE DATA AKA BOOLEANS
1 == 1

True

In [8]:
type(1 == 1)

bool

## Variables

To store data in memory, we use variables.

In [12]:
# You can see variables as labeled boxes in memory
box_name = "content of my box"

In [13]:
# Reuse a variable by calling it by its name
box_name

'content of my box'

In [14]:
# Check type of data inside variable
type(box_name)

str

In [15]:
# Change the content of your variable/box 
box_name = 356.78

In [16]:
# The content and type of data has changed
print(box_name)
type(box_name)

356.78


float

You can store anything in a variable, from basic data types like strings, floats, integers, and booleans, to n-dimensional tables and trained fancy machine-learning models.

## Methods & libraries

Python is so powerful because of its open-source community that everyday develops new "methods" for people like you and me to use and save hours of our time. Methods simply are bits of code that have been written by someone else and published online for others to install and reuse. There are methods for (almost) EVERYTHING! 

A method is applied on a variable using the following syntax `name_of_variable.method()`

In [1]:
# Example
"Julie".upper()

'JULIE'

Some methods require parameters:

In [4]:
# Example: "J" here is a parameter passed to .startswith()
"Julie".startswith("J")

True

Methods are often part of **libraries** that need to be imported in our workspace.</br>
You can see **libraries** as tool boxes containing a bunch of **tools**, aka **methods**.

In [None]:
# Snytax for importing the entire library
import library_name

In [None]:
# Syntax for importing a specific method from a library
from library import method_name

# 📊 Data visualization

## Library imports

❓ Start by importing the `pandas` and `plotly.express` libraries. You can give them nicknames, typically `pd` for pandas and `px` for plotly express.

In [9]:
# Insert your code below


## Data import

The CSV file that you'll work on today is located in the `data` folder and named `exoplanets.csv`.</br>
❓Create a variable `file_path` that contains the path to the dataset as a string.

In [10]:
# Insert your code below


❓ Now using the pandas method we saw in class, import your dataset inside the notebook (i.e. "read" your dataset using pandas).</br>
If you don't recall which method to use, check the cheat sheet or google it!

In [8]:
# Insert your code below


❓ Store your dataframe inside a variable called `exoplanet_df`.

In [7]:
# Insert your code below


In [8]:
# Test your code
print(type(exoplanet_df) == pd.DataFrame)
print(exoplanet_df.shape == (5250, 13))

True
True


Display your dataframe and take a few minutes to get familiar with it.</br>
Each row represents a different exoplanet, whereas each column represents a characteristic of exoplanets like their names, distance from Earth in light-years, how it was detected etc... 

In [9]:
# Run this cell
exoplanet_df

Unnamed: 0,name,distance,stellar_magnitude,planet_type,discovery_year,mass_multiplier,mass_wrt,radius_multiplier,radius_wrt,orbital_radius,orbital_period,eccentricity,detection_method
0,11 Comae Berenices b,304.0,4.72307,Gas Giant,2007,19.40000,Jupiter,1.080,Jupiter,1.290000,0.892539,0.23,Radial Velocity
1,11 Ursae Minoris b,409.0,5.01300,Gas Giant,2009,14.74000,Jupiter,1.090,Jupiter,1.530000,1.400000,0.08,Radial Velocity
2,14 Andromedae b,246.0,5.23133,Gas Giant,2008,4.80000,Jupiter,1.150,Jupiter,0.830000,0.508693,0.00,Radial Velocity
3,14 Herculis b,58.0,6.61935,Gas Giant,2002,8.13881,Jupiter,1.120,Jupiter,2.773069,4.800000,0.37,Radial Velocity
4,16 Cygni B b,69.0,6.21500,Gas Giant,1996,1.78000,Jupiter,1.200,Jupiter,1.660000,2.200000,0.68,Radial Velocity
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5245,XO-7 b,764.0,10.52100,Gas Giant,2019,0.70900,Jupiter,1.373,Jupiter,0.044210,0.007940,0.04,Transit
5246,YSES 2 b,357.0,10.88500,Gas Giant,2021,6.30000,Jupiter,1.140,Jupiter,115.000000,1176.500000,0.00,Direct Imaging
5247,YZ Ceti b,12.0,12.07400,Terrestrial,2017,0.70000,Earth,0.913,Earth,0.016340,0.005476,0.06,Radial Velocity
5248,YZ Ceti c,12.0,12.07400,Super Earth,2017,1.14000,Earth,1.050,Earth,0.021560,0.008487,0.00,Radial Velocity


## Explore & visualize your data

❓ How many exoplanets are there in the dataset? Store your answer in a variable called `n_exoplanets`.

In [8]:
# Insert your code below


In [11]:
# Test your code
n_exoplanets == len(exoplanet_df)

True

❓ How many exoplanets of each type have been detected so far?</br>
**Hint:** 2 out of the 3 methods seen in class can work, although 1 is more straightforward

In [9]:
# Insert your code below


❓ You can divide your answer above by the total number of exoplanets detected stored in `n_exoplanets` to get the ratio of each planet type. Try it yourself:

In [10]:
# Insert your code below


❓ Now multiply your answer above by 100 to make it a percentage. You can now a better idea of which planet types are most commonly detected relative to others.

In [11]:
# Insert your code below


As you can see, **terrestrial** planets only account for 3.7% of all detections. This is because they are very small and have a dim light that astronomical instruments can't easily detect.

We can also get those numbers with a single line of code using a pie chart. Check the [documentation](https://plotly.com/python-api-reference/generated/plotly.express.pie) if needed.</br>
❓ Using the `.pie` method, you'll need to give at least two parameters, the dataframe to use as `data_frame`, and the column to use as `names`.</br> Complete the code below to display the pie chart:

In [12]:
# Complete the code below
px.pie(data_frame = ???, names = ???)

❓ Let's store your graph inside a variable called `planet_type_pie`.</br>
This will allow you to use the `.update_layout()` method to customize your chart.

In [13]:
# Run the code below
planet_type_pie = px.pie(data_frame = exoplanet_df, names = "planet_type")
planet_type_pie.update_layout(width  = 600,
                              height = 600,
                              title_text = "Types of exoplanets discovered since 1992",
                              title_x = 0.5)

❓ Plotly offers a great variety of customizations, check out the [documentation](https://plotly.com/python/reference/layout/) of `.update_layout()` to play around with some parameters and modify the pie chart to your preferences.

In [14]:
# Insert your code below


Let's now investigate the evolution of detections through time.</br>
❓ Is there any relationship between the **year of discovery** and the **distance** of exoplanets from Earth?</br>
Try answering this question using a [scatter plot](https://plotly.com/python-api-reference/generated/plotly.express.scatter.html?highlight=plotly%20express%20scatter).

In [14]:
# Insert your code below


Plotly's `scatterplot` allows you to add a **color** dimension to your plot using the parameter `color`.</br>
❓ Using your answer above and the `color` argument, can you tell if there is a relationship between the **year of discovery**, the **distance** of exoplanets from Earth, and the **type of exoplanet**? Check out the [documentation](https://plotly.com/python/line-and-scatter/) if necessary.

In [15]:
# Insert your code below


❓ Now take a couple of minutes to reflect on your scatterplot, does it tell you anything valuable about exoplanet discoveries?</br>Data visualization is not only about creating nice plots but most importantly it is about interpreting and communicating the valuable information they display.

**Answer:** You should notice that from 1992 until 2003, the few exoplanets that were discovered were less than 2000 light-years away from Earth. Then in 2004, astronomers were able to detect more planets up to 18k light-years away. Now, they consistently detect hundreds of exoplanets that are more than 20k light-years away.

❓ Nice plot but it can definitely be improved. Use your answer to Q to add a main title to your scatterplot and make other changes to your preferences.

In [16]:
# Insert your code here


Also, the titles of the x and y axes can probably be more explicit.
❓ Use the [`.update_xaxes()`](https://plotly.com/python/reference/layout/xaxis/) and [`.update_yaxes()`](https://plotly.com/python/reference/layout/yaxis/) on your scatterplot to change the titles of your axes, don't forget to include units! Check out the documentation if necessary.

In [19]:
# Insert your code below


In class, we saw how to create new **features** (i.e. new columns) using existing ones.</br>
Specifically, we created a new **feature** called `exoplanet_mass` using the `mass_multiplier` and `mass_wrt` columns.</br>Run the cell below to create this new feature.

In [20]:
# Run this cell
exoplanet_df["conversion_factor_m"] = exoplanet_df["mass_wrt"].map({"Jupiter": 318, "Earth": 1})
exoplanet_df["exoplanet_mass_wrt_earth"] = exoplanet_df["mass_multiplier"]*exoplanet_df["conversion_factor_m"]
exoplanet_df.drop(columns = ["conversion_factor_m", "mass_multiplier", "mass_wrt"], inplace = True)

In [None]:
# Run this cell to check the new columns were added to your dataframe
exoplanet_df

❓ Based on how `exoplanet_mass_wrt_earth` was computed, create a new **feature** called `exoplanet_radius_wrt_earth`.</br> **Hint:** Jupiter has a radius approx. 11 times greater than Earth's.

In [21]:
# Insert your code below


In [24]:
# Check that two new columns have been added to your dataframe
exoplanet_df

Unnamed: 0,name,distance,stellar_magnitude,planet_type,discovery_year,orbital_radius,orbital_period,eccentricity,detection_method,exoplanet_mass_wrt_earth,exoplanet_radius_wrt_earth
0,11 Comae Berenices b,304.0,4.72307,Gas Giant,2007,1.290000,0.892539,0.23,Radial Velocity,6169.20000,11.880
1,11 Ursae Minoris b,409.0,5.01300,Gas Giant,2009,1.530000,1.400000,0.08,Radial Velocity,4687.32000,11.990
2,14 Andromedae b,246.0,5.23133,Gas Giant,2008,0.830000,0.508693,0.00,Radial Velocity,1526.40000,12.650
3,14 Herculis b,58.0,6.61935,Gas Giant,2002,2.773069,4.800000,0.37,Radial Velocity,2588.14158,12.320
4,16 Cygni B b,69.0,6.21500,Gas Giant,1996,1.660000,2.200000,0.68,Radial Velocity,566.04000,13.200
...,...,...,...,...,...,...,...,...,...,...,...
5245,XO-7 b,764.0,10.52100,Gas Giant,2019,0.044210,0.007940,0.04,Transit,225.46200,15.103
5246,YSES 2 b,357.0,10.88500,Gas Giant,2021,115.000000,1176.500000,0.00,Direct Imaging,2003.40000,12.540
5247,YZ Ceti b,12.0,12.07400,Terrestrial,2017,0.016340,0.005476,0.06,Radial Velocity,0.70000,0.913
5248,YZ Ceti c,12.0,12.07400,Super Earth,2017,0.021560,0.008487,0.00,Radial Velocity,1.14000,1.050


Now that you have both the mass and the radius of each exoplanet with respect to Earth, you can compute the **density**.</br>
❓ Create a new `density` **feature** in your dataframe, where</br>

$$density = \frac{mass}{volume},\hspace{5 mm}volume = \frac{4}{3}\times\pi\times{R}^3$$

In [25]:
# Insert your code below


In [26]:
# Check that your new feature has been added to your dataframe
exoplanet_df

Unnamed: 0,name,distance,stellar_magnitude,planet_type,discovery_year,orbital_radius,orbital_period,eccentricity,detection_method,exoplanet_mass_wrt_earth,exoplanet_radius_wrt_earth,density
0,11 Comae Berenices b,304.0,4.72307,Gas Giant,2007,1.290000,0.892539,0.23,Radial Velocity,6169.20000,11.880,0.878843
1,11 Ursae Minoris b,409.0,5.01300,Gas Giant,2009,1.530000,1.400000,0.08,Radial Velocity,4687.32000,11.990,0.649529
2,14 Andromedae b,246.0,5.23133,Gas Giant,2008,0.830000,0.508693,0.00,Radial Velocity,1526.40000,12.650,0.180106
3,14 Herculis b,58.0,6.61935,Gas Giant,2002,2.773069,4.800000,0.37,Radial Velocity,2588.14158,12.320,0.330588
4,16 Cygni B b,69.0,6.21500,Gas Giant,1996,1.660000,2.200000,0.68,Radial Velocity,566.04000,13.200,0.058784
...,...,...,...,...,...,...,...,...,...,...,...,...
5245,XO-7 b,764.0,10.52100,Gas Giant,2019,0.044210,0.007940,0.04,Transit,225.46200,15.103,0.015632
5246,YSES 2 b,357.0,10.88500,Gas Giant,2021,115.000000,1176.500000,0.00,Direct Imaging,2003.40000,12.540,0.242665
5247,YZ Ceti b,12.0,12.07400,Terrestrial,2017,0.016340,0.005476,0.06,Radial Velocity,0.70000,0.913,0.219694
5248,YZ Ceti c,12.0,12.07400,Super Earth,2017,0.021560,0.008487,0.00,Radial Velocity,1.14000,1.050,0.235217


❓ Now that you added a few more **features** to your data, let's see if they're correlated with each other.</br>
To do this you can first create a correlation matrix with pandas' `.corr()` method.</br>
Then display your matrix as a heatmap using Plotly's `.imshow()` method.

In [18]:
# Insert your code below


❓ Let's improve your heatmap. Check out Plotly's [documentation](https://plotly.com/python/heatmaps/) and let's add the R-squared values on the heatmap's cells.

In [19]:
# Insert your code below


❓ R-squared values are now displayed on the plot but they have way too many decimals.</br>Modify your `matrix` using the `.round()` method to only have 2 decimals. Check out [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.round.html) documentation if needed.

In [32]:
# Insert your code below


❓ Now plot again your rounded matrix using `.imshow()`.

In [20]:
# Insert your code below


❓ You can also change the color scheme of your heatmap using the `color_continuous_scale` parameter of `.imshow()`.</br>
Check out Plotly's [documentation](https://plotly.com/python/colorscales/#color-scales-in-plotly-express) if necessary. Also, [here](https://plotly.com/python/builtin-colorscales/) you'll find all of Plotly's built-in color scales.

In [21]:
# Insert your code below


❓ Finally, by storing your heatmap inside a new variable, say `correlation_map`, you can use the `.update_layout()` method to add a main text, fix the map's width and height, and make any other styling modification you want.

In [None]:
# Insert your code below


Now that you have a pretty-looking correlation map, take a couple of minutes to reflect on it.</br>
❓ Which features are positively correlated?</br>
❓ Are there any features that are negatively correlated?

**Answer:**
1. The strongest positive correlation exists between the `orbital_radius` and the `orbital_period` of an exoplanet. This is because the two are directly related mathematically. The `orbital radius` of an exoplanet combined with its orbital speed, that you could easily compute and add to your dataframe, determines the `orbital_period`.</br>
2. More interestingly, the `stellar_magnitude` of an exoplanet is positively correlated to its `distance` from Earth. The `stellar_magnitude` represents the brightness of an exoplanet. Most astronomical observations are made by instruments on Earth or in Earth's orbit. Therefore, the further the exoplanet is, the less bright it appears to our telescopes, meaning that the greater the `distance` between an exolpanet and Earth, the lower the brightness of the exoplanet. So how come do we observe a **positive** rather than **negative** correlation between `distance` and `stellar_magnitude`? This is because by convention, the brighter the exoplanet is, the lower the value of its `stellar_magnitude`.</br>
3. Finally, from the heatmap we can also observe a rather strong **negative** correlation between an exoplanet `eccentricity` and its `stellar_magnitude`. 

❓ Let's now investigate the distribution of densities per planet type.</br>
A great way to visualize a distribution is to use a [box plot](https://plotly.com/python/box-plots/).

In [26]:
# Insert your code below


❓ You might want to convert the y-axis into a **log scale**. Using the detailed [documentation](https://plotly.com/python-api-reference/generated/plotly.express.box), find the method parameter that allows you to transform the y-axis into a log scale.

In [None]:
# Insert your code below


❓ Store your box plot inside a new variable and use the `.update_layout()` to style it to your preferences.

In [None]:
# Insert your code below


Take a moment to reflect on your box plot, what can you tell about the densities of each planet type?</br>
Why are **terrestrial** planets labelled **terrestrial**?

For the final part of this notebook, let's have a look at the `detection methods`.</br>
❓ Using the `.groupby()` method, aggregate `exoplanet_df` so as to get the total number of exoplanets detected for each `detection_method`. Think about the aggregator function to use in combination with `.groupby()`.</br>
Check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) if necessary.

In [5]:
# Insert your code below


❓ The `.groupby()` method outputs a new dataframe where the column you passed inside `.groupby()` is now the row index, which is not great. Check out the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) again and modify your code above to prevent `detection_method` to become the row index. See how now `detection_method` remains a column of `exoplanet_df`.

In [None]:
# Insert your code below


Pandas' `.groupby()` is very powerful and offers a wide variety of possibilites to aggregate your dataframe. In fact, you can aggregate your data on **two** or more columns.</br>
❓ Modify your `.groupby()` code above as to aggregated `exoplanet_df` on `detection_method` and `planet_type`.

In [25]:
# Insert your code below


Also, the `.groupby()` methods returns a new dataframe with all the original columns where in fact you only need to keep one of the aggregated columns.</br>
❓ From your double aggregation above, retrieve only the `detection_method`, `planet_type`, and `name` columns.
Store your dataframe inside a new variable called `method_per_planet_type`.

In [24]:
# Insert your code below


In [23]:
# Test your code
print(method_per_planet_type.shape == (28,3))
list(method_per_planet_type.columns) == ["detection_method", "planet_type", "name"]

❓ Finally, let's now plot a `bar` chart illustrating the total number of exoplanets discovered by each detection method and for each planet type.</br>Check out Plotly's [documentation](https://plotly.com/python/bar-charts/#bar-chart-with-plotly-express) if necessary. You might want to convert your y-axis into a log scale.

In [22]:
# Insert your code below


❓ Store your bar chart inside a new variable and use the `.update_layout()` methods to style your graph to your preferences.

In [None]:
# Insert your code below


# 🎉 Congratulations on finishing this notebook!

You now know how to read, manipulate, and visualize .CSV files with `pandas` and `plotly express`!