# It's Day 1 of the workshop and I know some things!

Welcome to day 1 of the bootcamp! Now that we've covered the basics, we can start diving into what makes programming useful and powerful. To begin, let's quickly review what we learned in the pre-workshop. *Note: You'll need to run all code cells of this tutorial in order.*

### Variable Declarations
We learned that to store information (of any kind) in Python, we want to set a variable name equal to that information, and then use that name to perform calculations on it.

### Data Types
We learned that Python has different rules for different kinds of data — it performs calculations differently on integers than on floats, treats lists differently than numpy arrays, etc. Figuring out what data type is the most efficient and effective way to work with your data is one of the key conceptual skills to learn when programming. 

### Lists and Indexing
We learned that the "default" way to store simple data (say, a bunch of numbers) is in a **list**, which can then be **indexed** by element number (starting with zero) to extract values from the list. We learned that lists can be fed into certain functions, like sum(), to return the sum of all numbers in the list (assuming the list is, indeed, all numbers). 

### Debugging (barely)
You probably didn't notice, but we practiced a little bit of debugging as well — we printed out lists to make sure they were filled with the numbers we wanted after a calculation, a simple form of debugging! 

## What we will cover here
By the end of this tutorial, we hope you will be able to handle the first task a professor might give you when starting to do research with them — loading up some data from a simple ASCII file, performing some calculations, and plotting it. To do all this, we will need to learn a bit of the **Numpy and Pandas Libraries**, some **conditional statements** and **loops**, and some new **plotting** techniques. We will also introduce the concept of **functions** here that is important for advanced Python programming. These concepts are covered in Chapters 3 - 7 of the [textbook](https://prappleizer.github.io/index.html). You can read them if you have time (the chapters are fairly short), but they cover more than what you will be needing in this tutorial.

Without further ado, let's jump in! 

### Numpy, Scipy, Matplotlib, and Beyond
In the pre-workshop tutorial, we had to resort to calling a special data type that was not native to Python — the Numpy Array. This was useful to us because of a special behavior: Operating math on an array performs the operation on each value in the array, useful for say, subtracting the mean from every value and then squaring every value. 

But what *is* Numpy, actually? 

Believe it or not, from a mathematical perspective, what you saw in the previous tutorial was just about the limit of Python's native math functionality. You can add, subtract, multiply, exponentiate, and take modulos. To do anything more complicated — like, say, calculate a sine or cosine, we need to actually **import** libraries of functions which can accomplish these tasks. 

#### What's a function? 
It's useful to take a second to make sure we're on the same page about functions. A function is something that takes one or more inputs, and spits something out. When, in math class, you write y = sin(x), "sin" is the function you are using. The "x" you are plugging in is the *argument* of that function, and you are storing its *output* in the variable "y". If we use the range() function, to create a list from 1 to 10, 

In [None]:
y = range(1, 11)
print(y)

Then "1" and "11" are arguments to the range function, and it's output is stored in "y". Note that print() is also a function — it takes in the argument "y" and spits out its value onto the screen. 

Back to the task at hand. If we want to calculate the sine of a number, $x$, we can't do that in native Python. But luckily, many clever people have crafted libraries of functions which can. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Okay, so to use these libraries of functions, we have to **import** them into our code, as we have above. Notice that we could just "import numpy" as well — but Python lets you give the library a "nickname" shorthand so that in your code, you don't have to type out "numpy" every time. In other libraries, you can choose whatever you want, but generally, numpy is imported as np and matplotlib.pyplot (a subset of matplotlib with the plotting commands we'll be using) as plt. Don't worry about the "inline" — it's just required to make plots appear in this notebook rather than a separate window.

Now, we can create our sine:

In [None]:
x = np.linspace(0, 10, 100)
y = np.sin(x)

Woah! New function alert! We also just used np.linspace(), a function which, unlike range (which has you pick a start and stop and advances by integer (or multiples) steps in between),  lets you pick a start, a stop, and a number of subdivisions, and then will try to space them evenly. Read it as "give me an array from 0 to 10 with 100 evenly spaced points." Read line two as "Give me an array that contains the sine of each value in the x array." 

Now let's whip out our plotting:

In [None]:
plt.plot(x,y)

The above is the absolute barebones you can plot: $x$ against $y$. We've used fancier plotting techniques before and will get into them a bit later. 

## Loading Data from a File
It is day 1 of your new research assignment. You've just met with the professor or post-doc. They've probably given you like 10,000 papers to read (skim). They might also have given you a file or two of data, and told you to "familiarize" yourself with the format, get it into Python, and make some plots. 

This is what we are going to learn to do now.

### Loadtxt() and Genfromtxt()
Astronomical data are stored in a huge variety of file formats and organizational schemes. Let's start with the most simple and build up. In Ye Olde Days, basically all data were kept in plaintext ASCII files — in short, text. Things have changed recently, though often times data tables are still the most efficient means of storage, they are now wrapped inside file formats like FITS and HDF5 to make them more portable and stable over time. At the end of the day, we are most interested in getting past those layers of protocols to the raw numbers underneath, which we want sitting around in arrays we can mess with. 

We started with the most simple of cases in the previous tutorial: The ASCII text file. We will be going over FITS files on day 2 the workshop, but for now let's stick to text files. You should have access to the "cumulative_2022.04.26_22.11.54.csv" file in the "pasea2022/PythonML_workshop/Day_1/data" folder on your Google Drive if you uploaded the "pasea2022" GitHub repository to your Drive - we will help you if you haven't. This file contains Kepler data that we will investigate to understand exoplanets!

Your first task: use the cell below to load the data using Numpy's loadtxt() function into a variable called "kepler".

The file you are loading contains several "ID" and "name" columns, the first is the Kepler ID, the second is the Kepler object of interest (KOI) name, and the third is the Kepler name. So, make sure you use the right "dtype" option as you did in the last tutorial.

In [None]:
# Your code here (remember to mount the Google Drive)

In the cell below, once you get it to load without throwing an error, print the array to see what it looks like. 

In [None]:
kepler_array

Notice how none of the lines starting with a "#" in the "cumulative_2022.04.26_22.11.54.csv" file are present in "kepler_array". This is because np.loadtxt() removes any "#" lines by default as they represent comments. You can explicitly specify what symbol represents comments using the "comments" keyword.

There are other ways to load text files. You can use the "np.genfromtxt()" function that is slightly more powerful than np.loadtxt(). You can look up the documentation of np.genfromtxt() to see this for yourself. There is also the **pandas** library for data analysis and manipulation that we recommend you use for loading csv files! Let's import this library

In [None]:
import pandas as pd

Now, we can read the kepler file using the "pd.read_csv()" function as follows:

In [None]:
# Use the right path to the "cumulative_2022.06.13_17.10.09.csv" file
# The path will change depending on how you downloaded and uploaded the file(s)
kepler_df = pd.read_csv('drive/My Drive/pasea2022/PythonML_workshop/Day_1/data/cumulative_2022.06.13_17.10.09.csv', comment="#")

Now print "kepler_df"

In [None]:
kepler_df

That looks like a table! You have 9564 rows × 48 columns. The rows are indexed from 0 to 9563 and the columns are indexed by their names.

Let's check the datatype of "kepler_df"

In [None]:
# Your code here

We get what is called a pandas "DataFrame". 

You can index a column in a DataFrame using its name (similar to how you did indexing with dictionaries)

In [None]:
kepler_df['kepid']

You see that the data type of all the values in this column is "int64". Thus, pd.csv() was able to interpret the data type. 

To index the 2nd row of the DataFrame, you use the following code

In [None]:
kepler_df.loc[2]

Here we see a few entries that say **NaN**. pd.csv() automatically converts all missing values to "NaN", which stands for not a number. It is a numeric data type that represents undefined values (e.g., 0/0). Let's explicitly check the data type of one such value

In [None]:
type(kepler_df.loc[2]['kepler_name'])

The value is numeric! This is good because we can apply numeric operations to numeric arrays despite having missing (NaN) values. More on this will be covered later in the workshop.

Now let's acess the same row as above using the "kepler_array" ndarray

In [None]:
# Row 3 because the header forms Row 0
kepler_array[3]

You see that the same entries that are marked "NaN" in the DataFrame are empty strings ('') in the ndarray.

So now we are starting to see a methodology for extracting the numbers out of the strings. Our next step is going to be searching for multiple observations of a Kepler target (a star) in the Kepler data and analyzing them to understand the target and its exoplanets. We are going to take a detailed look into pandas DataFrames and **for-loops**.

For indexing all observations of a particular target, we can use "kepid" as the index. This is shown below

In [None]:
kepler_df = kepler_df.set_index('kepid')

In [None]:
kepler_df

We can now index using the first Kepler ID in the list.

In [None]:
kepler_10872983 = kepler_df.loc[10872983]

In [None]:
kepler_10872983

Pandas DataFrames have a "describe" function that computes summary (descriptive) statistics as follows:

In [None]:
kepler_df.describe()

### For-Loops
There are two primary looping methods in Python: For-loops and While-loops. We'll focus on For-loops for a second. 

A For-loop allows you to specify what's known as an iterator — usually an increasing array of indices — which let's you run a block of code over and over again under slightly different circumstances. For example, what if we wanted to advance through the observations of the Kepler target: 10797460 array, and on a new line each time, print the name of the exoplanet detection. We could do that with the following:

In [None]:
for i in kepler_10872983['kepler_name']:
    print(i)


OK, so what just happened? By saying `for i in kepler_10872983['kepler_name']`, we were telling the computer that `kepler_10872983['kepler_name']` was a container with multiple "things" in it (the entries we saw above). We told it "Hey, for every *thing* in `kepler_10872983['kepler_name']`, print out that *thing*."

Notice that this worked because we, the programmers, knew that `kepler_10872983['kepler_name']` was something that could be advanced through. 

Let's see another example. Remember the range() function? We can use that as a iterator as well:

In [None]:
for i in range(10):
    print('I am analyzing data unit: {}'.format(i))

Remember, range(10) can be treated as [0,1,2,3,4,5,6,7,8,9]. You could, to see it more clearly, say:

In [None]:
thing_to_loop_over = range(10)
for thing in thing_to_loop_over:
    print('I am analyzing data unit: {}'.format(thing))

I'm also highlighting here that while "i" is a standard choice for an outer loop iterating variable name (followed by "j" and "k"), you can use whatever you want as long as it's consistent in the loop. 

We can also use loops to fill an empty array with values, e.g.

In [None]:
to_fill = []
for i in kepler_10872983['kepler_name']:
    to_fill.append(i[-1])
print(to_fill)

What we've done here is make a list containing all the exoplanet identifiers, as strings! For each item in `kepler_10872983['kepler_name']`, we first take the last element (the i[-1] part), then we **append** that value to the empty "to_fill" array outside. Appending to lists is easy, as shown, as a *method* of lists. So to add anything to the end of a list, you just write list_name.append(thing_to_add). 

### Exercise 1: A dictionary of Kepler IDs / exoplanets
OK, it's time for you to dive in! Once you've gotten the hang of the above, and maybe played around a bit with it, try the following. 

In the cell block below, define an empty dictionary called "kepler_planets". Then, initialize a for-loop that goes through the "kepler_array" ndarray, and puts each Kepler ID (as a string or an integer) as a key, and the total number of confirmed exoplanets revolving it (as a float) as a value. 

You can set new values in a dictionary even easier than appending, simply use 

dictionary_name['new key'] = new value

Hint: The 'koi_disposition' column tells us whether an exoplanet is 'CONFIRMED', a 'CANDIDATE', or a 'FALSE POSITIVE', and the 'kepler_name' column mentions the name of confirmed exoplanets.

*Note: Here all our keys are strings, and this is often the use-case for dictionaries, but it is not required. You can make dictionaries whose keys are, for example, integers. 

In [None]:
kepler_planets = {}

# Skip the header row (column names)
kepler_arr_dat = kepler_array[1:]

# Choose all the planets with `koi_disposition = CONFIRMED`
kepler_arr_dat_conf = kepler_arr_dat[kepler_arr_dat[:, 3] == 'CONFIRMED']

# Your code here

Did it work? Try indexing for the Kepler ID '10872983' in the cell below:

In [None]:
kepler_planets['10872983']

Did you get 3? Or are you struggling to create the dictionary?

Do not worry if you weren't able to write the above code. The reason we asked you to try writing it is to show that it is not straightforward to perform the above opertaion using Numpy (nd)array, but relatively easy using Pandas DataFrame! Try the following code:

In [None]:
# Group entries in `kepler_df` by `kepid`
kepler_df_groups = kepler_df.groupby('kepid')

# Find the number of unique names of confirmed exoplanets
kepler_planets = kepler_df_groups['kepler_name'].nunique()

# Print examples
print(kepler_planets[10872983], kepler_planets[10797460], kepler_planets[10854555], kepler_planets[10811496])

In [None]:
type(kepler_planets)

Here "kepler_planets" is indexed by an integer that represents the Kepler ID (rather than a string). Its datatype is "pandas.core.series.Series", which works similar to a dictionary. You can convert the Series to a dictionary using `kepler_planets.to_dict()`

To see if the new "kepler_planets" Series is doing the right thing, index "kepler_df" for the Kepler IDs '10872983' and '10811496':

In [None]:
# Your code here

In [None]:
# Your code here

Looks like "kepler_planets" contains the right information! 

Now let's find the Kepler ID with the most number of confirmed exoplanets.

In [None]:
kepler_planets.max()

In [None]:
kepler_df.loc[kepler_planets.idxmax()]

We can also find the mean number of exoplanets.

In [None]:
kepler_planets.mean()

Looks like the mean is 0.32. How do we know if a Kepler ID has more or less (than the mean) confirmed exoplanets? That's, naturally, where conditional "if-statements" come in. 

## If Statements and other Conditionals
The problem I've posed, of figuring out whether a condition is true or not, is addressed in code via conditional statements. They run, logically, along the lines of "IF something is TRUE, do THIS, IF something ELSE is TRUE, do THAT, OTHERWISE do SOMETHING ELSE." 

Here's an example:

In [None]:
fig, ax = plt.subplots(1, 1)

for ind, row in kepler_df.iterrows():
    if row['koi_disposition'] == 'CONFIRMED':
        ax.scatter(row['koi_period'], row['koi_prad'], color='blue', s=2)

# We will go over the plotting details later
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlim(10**(-1), 10**(4))
ax.set_ylim(10**(-1), 10**(6))
ax.set_xlabel('Period [days]')
ax.set_ylabel('Radius [Earth Radii]')
ax.set_title('Confirmed Planets')

We can also link several if-statements using elif statements, which are combinations of else and if:

In [None]:
fig, ax = plt.subplots(1, 1)

for ind, row in kepler_df.iterrows(): 
    if row['koi_disposition'] == 'CANDIDATE':
        ax.scatter(row['koi_period'], row['koi_prad'], color='red', s=2, alpha=0.5)
    # Your code here: use elif statement here to plot 'CONFIRMED' Kepler planets

ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlim(10**(-1), 10**(4))
ax.set_ylim(10**(-1), 10**(6))
ax.set_xlabel('Period [days]')
ax.set_ylabel('Radius [Earth Radii]')
ax.set_title('Candidate + Confirmed Planets')

What happened here? We said "IF the Kepler object of interest is a planet CANDIDATE, plot its radius vs period in red, ELSE, IF it is a CONFIRMED planet, plot its radius vs period in blue, ELSE do nothing (`pass`)". If the first condition is met, the elif (or else) is not triggered. Thus, by mixing together if's, else's, and elif's, you can check conditions you are interested in.

Notice that the above looks very similar to [this pre-generated plot](https://exoplanetarchive.ipac.caltech.edu/exoplanetplots/exokepler_all_radperiod.png) on the NASA Exoplanet Archive! You are on your way to making publication-quality plots. 

***

#### Q - What scientific result does the above plot communicate?

***

Ans.

Now, let's look at a simple example to better understand conditionals:

In [None]:
person_name = 'Finnaeus Fthrockbottom'
if len(person_name) < 10:
    print('This is a short name')
else: 
    print('This is a long name')

In [None]:
if ('F' in person_name) and ('n' in person_name):
    print('Both present')

if ('F' in person_name) or ('n' in person_name):
    print('One is present')

if ('l' not in person_name) and ('y' not in person_name):
    print('Neither present')

if ('l' not in person_name) or ('F' not in person_name):
    print('One is not present')

Take a moment parsing the above, seeing how you can string together conditionals. You can ask if things are in, or not in, other things, or you can compare values by asking if things are equal (==), not equal (!=), or greater than/less than (>, <). 

### Exercise: Sort the entries
Below, sort the indices (Kepler IDs) of the "kepler_df" DataFrame. You can google how to do that!

In [None]:
# Your code here

Try the same with the "kepler_array" ndarray.

Hint: Remember to typecast Kepler ID to integers.

In [None]:
# Your code here

So we've now seen how to get data from a text file into Python, where we can start interrogating it, and performing analysis and calculations with it. For a look at how to pull in data from more complicated systems, like FITS, check out the next tutorial.

Now, the other thing Python is great for is powerful visualizations. We saw in the pre-workshop tutorial how we could plot the histogram of scores and get an idea of where one standard deviation on either side of the mean was. We also plotted the Radius - Period distribution of Kepler planets above. Now we are going to do a bit more plotting with the Kepler dataset.

### Plotting a histogram
Let's jump back, to start, with that histogram from pre-workshop, and see line by line how to make it. Don't feel discouraged if plotting commands seem like a whole new language over Python — they kind of are. It takes a lot of practice and experience to build up familiarity with what commands make plots look certain ways. For now, googling "how to add <insert> to a plot" is fine. 

The simplest kind of plot possible is plt.plot(), as for the sine wave we plotted above. But that doesn't really help us with 1D data, e.g., looking at the periods of all the Kepler exoplanets). We need to extend into a new dimension to plot anything, which is why seeing how many planets have a particular period is an interesting metric. That's a histogram. Matplotlib has a built in function to plot these. We can start by specifying nothing but the values to histogram (`kepler_df['koi_period']`):

In [None]:
plt.hist(kepler_df['koi_period'])
plt.show()

The above plot doesn't look very useful. That is because the x-axis (period) range is too large (from 0 to 120,000 days!). Let's choose a smaller range that is more realistic using the "range" parameter.

In [None]:
plt.hist(kepler_df['koi_period'], range=[0, 650])
plt.show()

That looks better, but the y-axis value (frequency or number of planets) of each bin is either too high or too low. Let's use a logarithmic scale for it:

In [None]:
plt.hist(kepler_df['koi_period'], range=[0, 650])
plt.yscale('log')
plt.show()

That looks good!

By default, matplotlib picks a color, and creates ticks and labels as shown. What if we want the blue to actually be red? And semi transparent? We can do that:

In [None]:
plt.hist(kepler_df['koi_period'], range=[0, 650], color='r', alpha=0.5)
plt.yscale('log')
plt.show()

It's hard to see because there's nothing behind it, but we now have a semi-transparent red plot. Now, looking, we have by default created 10 bins. We can create more or less:

In [None]:
plt.hist(kepler_df['koi_period'], bins=20, range=[0, 650])
plt.hist(kepler_df['koi_period'], bins=5, range=[0, 650], color='r', alpha=0.5)
plt.yscale('log')
plt.show()

We can see that by decreasing the number of bins, more items (kepler ) appear in each bin. By increasing the number of bins, the opposite effect occurs. Often, we want to normalize histograms:

In [None]:
plt.hist(kepler_df['koi_period'], bins=20, range=[0, 650], density=True)
plt.hist(kepler_df['koi_period'], bins=5, range=[0, 650], color='r', alpha=0.5, density=True)
plt.yscale('log')
plt.show()

By normalizing, we can see the distributions overlaid on each other. It seems like the default of 10 was a decent number of bins. Let's try 9, un-normalize, and then add some labels:  

In [None]:
plt.hist(kepler_df['koi_period'], bins=9, range=[0, 650], color='r', alpha=0.5)
plt.yscale('log')
plt.xlabel('Period [days]')
plt.ylabel('Number of Planets')
plt.title('Distribution of Kepler Planet Periods')
plt.show()

Often we are interested in knowing the percentiles of a distribution — the standard in statistics is the 16th, 50th, and 84th percentile. You can calculate these easily with `np.percentile(array_like, #)`, where # is the percentile you want to calculate. 

I'll tell you that the function `plt.axvline(value, ls='--', color='k')` will plot a vertical black dashed line at a certain x-axis value "value". In the block below, reproduce the plot in the cell above, but with the standard percentile spots demarcated by vertical dashed lines. 

In [None]:
# Your code here

How about the 95th percentile? Plot that below

In [None]:
# Your code here

We see that the distribution has some bimodality. Try to plot the same distribution, but for all confirmed Kepler exoplanets. You should be thinking about the condition we used in the "If Statements and other Conditionals" section above. We ran a **for** loop to iterate through the `kepler_df` rows, and checked **IF** the 'koi_disposition' of the row was 'CONFIRMED', i.e., if the potential exoplanet was confirmed. Instead of looping through each row, we can directly get all the confirmed exoplanet entries as follows:

In [None]:
kepler_df_conf = kepler_df[kepler_df['koi_disposition'] == 'CONFIRMED']

The above is a way to add a conditional to a pandas DataFrame. It also works with numpy (nd)arrays. Note that this is more efficient than the for loop we used above. Try to think why!

Let's plot the historgram of the periods of `kepler_df_conf`:

In [None]:
plt.hist(kepler_df_conf['koi_period'], bins=9, range=[0, 650], color='r', alpha=0.5)
plt.yscale('log')
plt.xlabel('Period [days]')
plt.ylabel('Number of Planets')
plt.title('Distribution of Kepler Planet Periods')
plt.show()

The bimodality is gone!

***

#### Q - Why do you think the bimodality disappeared in the last step?

***

Ans.

Let's get a little bit fancier. We want to write a function that will compare the number of exoplanets of any number of stars, given a list of Kepler IDs. It should take in the IDs as integers, and then create a horizontal bar plot showing their respective number of exoplanets, with the IDs on the y axis. Why horizontal? Think about it — the maximum number of exoplanets of one Kepler target (star) is 7 (we calculated this above!), and the width of our computer screen is fixed, while the number of names we enter is variable, and our computer can scroll to accomodate any reasonable height. This way, our names won't get squished trying to fit everything in. 

**Step One** A function that can take in different numbers of arguments. 
Take a look at the following:

In [None]:
def a_function(arg1, arg2):
    computation = arg1 + arg2
    return computation

We can run the above and feed it two numbers:

In [None]:
a_function(1, 5)

And you can see it did the computation and returned it. But what if we want to add three numbers?

In [None]:
a_function(1, 5, 6)

What we get here is a "TypeError." It's raised because our function was specified to take exactly 2 arguments (arg1 and arg2), but we gave it three. Shoutout to python's error message actually being helpful. OK, so how do we fix this? 

Here's one way:

In [None]:
def new_func(array_like):
    out_sum = 0
    for i in array_like:
        out_sum += i
    return out_sum

What I've done is force the user to enter a list of numbers, then iterated through and added them all up. (Yes, we could've just run np.sum() on the array_like, but what's the fun in that?). But that's just a workaround — sometimes, we need the function to take a truly variable number of inputs. 

That's where **args** and **kwargs** come in. Check this out:

In [None]:
def sum_func(arg1, arg2, *args):
    out_sum = arg1 + arg2
    for i in args:
        out_sum += i 
    return out_sum

What's going on? Let's test the function a bit:

In [None]:
sum_func(1, 2)

In [None]:
sum_func(1, 2, 3)

In [None]:
sum_func(1, 2, 3, 4, 5)

By specifying \*args as the final input to the function, we told python "allow any extra arguments to be entered into this function, and store them in a list called args." Then, we calculated the first sum (the one that is required), and went through any extra numbers that might've been entered and added them in as well. 

There is a slightly different version of this that applies to a "dictionary" style way of doing things. See below:

In [None]:
def dict_funct(arg1, arg2, **kwargs):
    output_dict = {}
    output_dict[arg1] = arg2
    for i in kwargs.keys():
        output_dict[i] = kwargs[i]
    return output_dict

What I've done is made a function that takes 2 things, and puts them in a dictionary where the first argument is a key and the second is a value (for illustration). Watch:

In [None]:
one = dict_funct('key1', 5)
one

In [None]:
two = dict_funct('key1', 5, key2=6)
two

In short, \*\*kwargs tells python "allow the user to add extra variables to this function, but they have to be of the form a=b, and store those extra variables in a dictionary where each a is a key and each b is a value." 

Sometimes, args and kwargs are most useful not even because you want to use the extra optional arguments in a function, but because you want your intermediary function to allow anything to get dumped into it, and just return it and pass it all along to the next function in your program. 

OK. Back to our exoplanets. We want to compare at least a minimum of two Kepler IDs, and the ability to add in as many extra as we want. Our basic skeleton then will look something like

In [None]:
def compare_exoplanets(id1, id2, *args):
    "some code here"
    return

Now, let's practice making the bar plot. We'll be using plt.hbar(), which can take a list of strings (names) and corresponding list of values (scores), and make a bar plot (horizontal). See:

In [None]:
plt.barh([1,2,3], [4,3,6])
plt.show()

Now, we want the 1,2,3 to actually be the Kepler IDs. So we can manually set the tick labels for the plot as follows: 


In [None]:
tick_labels = (10872983, 10797460, 10854555)
plt.barh([1,2,3], [4,3,6])
plt.yticks([1,2,3], tick_labels)
plt.show()

Cool! We're basically ready to go here. Using what I've illustrated above, make a function which takes any number of IDs in our sample as integers, and makes the plot of their respective number of exoplanets. It's up to you which way you chose to index out the number of exoplanets, but the fastest way will be using the "kepler_planets" dictionary or series we made above! Throw a title and axis labels on there while you're at it. Then test it out on first 2 IDs, then 3.

In [None]:
# Your code here

In [None]:
# Try running your function with this block, and seeing if you get the right plots.
compare_exoplanets(10872983, 10797460)
compare_exoplanets(10872983, 10797460, 10854555)

There might be some nomenclature that's a bit unfamiliar to you in the way we designed our function, if you check our solutions, though you should have been able to accomplish what was needed using for loops and things we've learned so far. But to clue you in, we utilized two basic Python behaviors to accomplish our task in fewer lines: List addition, and list comprehension. 

List addition is simply the fact that to combine two lists into one, just add them:

In [None]:
[1,2,4] + [4,5,6]

Thus, if you have a list, and two separate values (the way you probably did when args is a list of IDS, but you have two IDs floating around outside, you can make a consolidated list by putting the two floating IDs into their own list and adding them to the rest. In our example, 

\[ID1\] + \[ID2\] + args

has the same effect as

\[ID1, ID2\] + args would have.

The other thing we did was a list comprehension. Watch the following:

In [None]:
empty_list = []
for i in range(10):
    empty_list.append(i*2)
empty_list

We used a for-loop to fill an empty list with the values of range(10) each times two. We can also use the following:

In [None]:
full_list = [i*2 for i in range(10)]
full_list

Essentially, we compress the for-loop iteration into 1 line. Python knows we mean "create a list with values that are i\*2 for each i in range(10)". We can do this in many situations, which saves us space in our code, and is often faster computationally as well. 

**Better Function Writing** 

Let's take a few steps to make our comparison function better. The first thing we want to do is add *documentation*. This tells people how to use the function. Usually, documention looks something like this:

In [None]:
def compare_exoplanets(id1, id2, *args):
    '''
    A function to produce a horizontal bar plot comparing the
    number of exoplanets of different Kepler targets (IDs)
    INPUTS:
        id1 (int): the target Kepler ID in the kepler_planets Series
        id2 (int): the target Kepler ID in kepler_planets
        *args (optional, string): any number of Kepler IDs from kepler_planets
    PRODUCES: 
        A bar plot 
    RETURNS:
        NONE
    '''
    # Code goes here (not to spoil the above exercise!)
    return

Now, if someone is looking at our code, they can easily figure out that they need, for example, to have a dictionary or series called "kepler_planets" defined in their code for this function to work. Actually, the fact that our function requires that is bad, we'll get to that in a minute. If someone were using our code but not actually looking at the text file, they could type:

In [None]:
help(compare_exoplanets)

And our documentation for it would pop up in their terminal, making it easy for them to make sure they are using it properly. 

Back to what we said about the "kepler_planets" Series. Inside our function, we index "kepler_planets" to get the number of exoplanets of an ID. But what if "kepler_planets" wasn't defined in our code? Our function wouldn't run. If we copied and pasted our function into another file, it wouldn't run by default. In short, it's not **general**. It's best to make your code as reasonably generalizable as possible — it will help you re-use your own code later, and catch bugs. We can make our function more generalizable by requiring the user to *provide* a "kepler_planets" to the function. That truly isolates it, and means we can move it from file to file or know that our tests of it aren't importing problems from elsewhere in our code. 

But what if we don't want to manually type "kepler_planets" into our code when we run it, since, at least, in this file, we only have 1 commonly defined one? Check this out:

In [None]:
def compare_exoplanets(id1, id2, kepler_planets_series=kepler_planets, *args):
    '''
    A function to produce a horizontal bar plot comparing the
    number of exoplanets of different Kepler targets (IDs)
    INPUTS:
        id1 (int): the target Kepler ID in the kepler_planets Series
        id2 (int): the target Kepler ID in kepler_planets
        *args (optional, string): any number of Kepler IDs from kepler_planets
    PRODUCES: 
        A bar plot 
    RETURNS:
        NONE
    '''
    # Code goes here!
    return

What I've done is added a new required argument to our function (we named it "kepler_planets_series" to avoid confusion with "kepler_planets"). But in the function call itself, we set the input of "kepler_planets_series" equal to the "kepler_planets" we have sitting in our code. **Note:** pre-set or "default" arguments in functions must be defined *after* all the required, undefault ones (we couldn't put `kepler_planets_series=kepler_planets` before id1 and id2). 

This is a reasonable compromise for our code — we don't have to type `compare_exoplanets(id1, id2, kepler_planets, other_ids)` every time — we can use our function as normal. BUT, if we move our function to another code file, it's clear that we need to manually enter a new dictionary/series, or set one named "kepler_planets" outside our function in our code for it to work. 

As a final edition, I'll update the documentation to include the parameters imposed on the input dictionary/series. But I'll also make it the most general (not set a default), and move "kepler_planets_series" to the front of the required arguments (just for the aesthetic of giving a dictionary/series, then as many IDs as you want (min 2), rather than 2 IDs, a dictionary/series, and then more IDs. 

In [None]:
def compare_exoplanets(kepler_planets_series, id1, id2, *args):
    '''
    A function to produce a horizontal bar plot comparing the
    number of exoplanets of different Kepler targets (IDs)
    INPUTS:
        kepler_planets_series (dict): A dictionary containing Kepler IDs and number of
                                      planets of the form {id (int): planets (int)}
        id1 (int): the target Kepler ID in the kepler_planets Series
        id2 (int): the target Kepler ID in kepler_planets
        *args (optional, string): any number of Kepler IDs from kepler_planets
    PRODUCES: 
        A bar plot 
    RETURNS: 
        NONE
    '''
    # Code here
    return

***

#### Exercise: Do it yourself!

***

Here's a couple of exercises to play around with this function to make it EVEN MORE general, which you should be able to do with some quick googling. 

1. What if someone enters an ID in your function that isn't in the dictionary (not in Kepler or misspelled). As of now, your function will stop and throw a "key error", and say that the ID is not in the dictionary. For the sake of exercise, let's change that behavior, and ignore it if an ID isn't included moving on to all the other IDs and still producing the plot. Update your function such that if an ID isn't in the dictionary, it prints a warning "Warning, ___ wasn't in the kepler_planets dict, continuing..." so the user knows, but then still plots the rest of the (working) IDs. You could do this with an if-statement before actually querying the dict, or if you're adventurous, look up "try and except statements" online. 
2. Look around plt.barh's documentation, and see if you can plot the ID with the highest number of planets in a different color than the rest. Note, the easiest way might be to go through once plotting all in one color, and plotting the new color for the top score bar on top of it.
3. Try to create a new dictionary (or series) kepler_planet_periods of the form {'kepler_name' (str): koi_period (float)}. we want you to get comfortable with using strings as dictionary keys. Then, write a function compare_periods where you compare the periods of different Kepler planet names. Note that the string matching from function argument to dictionary key is exact — the user can't enter 'malena' if the key was 'Malena'. The best way around this might be to coerce all the strings to be all lower or upper case in the dictionary, and then coerce the user input to the function to be the same case (upper or lower) before attempting to query the dictionary. Look up how to make strings upper or lower case, and implement that in your function. 


Alright! That's it for this tutorial. As always, we hope it was helpful to you. If you have any questions about it (or find typos), or a question about your own code as you're getting started, feel free to email me!

For more information on functions, refer to the new [Functional Programming](https://prappleizer.github.io/Tutorials/FunctionalProgramming/FunctionalProgramming_web.html) chapter of the [textbook](https://prappleizer.github.io/index.html) we are using. This chapter is also available as a pdf in the "pasea2022/PythonML_workshop/Additional_Resources" folder.