# Lecture 5

**Authors:**
* Yilber Fabian Bautista 
* 

**Last date of modification:**
 December 23th 2021

Hello there,

welcome to Lecture 5 of this mini-lecture series on programing with Python. In this series, you will learn basic and intermediate python tools that will be of great use in your scientific carer.

By the end of this lecture you will be able to:
* Use **dictionaries** in python
* Use **pandas** library to  create, modify, save and read **csv** data files
* Plot using **pandas** library

## Dictionaries

In **Lecture 2** we learned how to use **lists** and **arrays** as data structures. In this lecture we introduce **dictionaries**

**What are Dictionaries?**

A dictionary consists of keys and values assigned to those keys. Since **dictionaries** have similar characteristics to **lists**, it is helpful to compare these two

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%202/images/DictsList.png" width="650" /> 

Figure taken from [IBM courses](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%202/images/DictsList.png)

Instead of the numerical indexes, dictionaries have keys, which are non-mutable objects like **strings** or **tuples**. These keys are the "indexes" that are used to access values within a dictionary. 
Likewise **lists**, **dictionaries** are mutable objects, which means  we can modify them, change values for a given key, add new keys with their corresponding values, etc. 

The python syntax to create a dictionary is the following:

```py
my_dictionary = {'key':value}
```

The `type` of a dictionary (`type(my_dictionary)`) is `dict`. 

Let us create our first **dictionary** in python:


In [None]:
import numpy as np
# Create the dictionary

my_dictionary = {"key1": 1, "key2": "2", "key3": np.array([3, 3, 8]),
                 "key4": (4, 4, '4'), ('key5'): 5, (0, 1): 6}
type(my_dictionary)

Notice that likewise for **lists** and **tuples**, dictionary elements do not have to be of the same `type`.

### **Accessing the keys of a dictionary**:

In case we do not know the keys of a given dictionary, the function `keys()` can be use to obtain a list of the keys of the dictionary.
```py
my_dictionary.keys()
```
### **Accessing  the value by the key:**

To access the elements of a given key, we have two options: the first one by using  an  "indexing"-like approach. That is

```py
my_dictionary['key1']
```
this will return the output `1`. The second option is by using the  `get()` method. It is used as

```py
my_dictionary.get((0, 1))
```
this will produce the output `6`. 

**Why is get useful?** If we ask for the value of the dictionary for a given key and the key does not exist in the dictionary, then the `get` method avoids error messages, and simply returns `None`. For instance if we did

```
my_dictionary.get('key7')
```
the output will be `None`.

### **Adding a new element to the dictionary**

For **lists** and **arrays** we would had use the `list.append(value)` and `np.append(array,value)` methods respectively. For **dictionaries** instead, we will use the following syntax:

```py
my_dictionary['key7'] = 'new_value'
```

### **Remove a key from a dictionary**

To remove an key with its respective values we use the `pop()` method. For instance, if we want to delete the newly assigned `key7` from our dictionary we use:
```py
my_dictionary.pop('key7')
```
then, if we ask for the keys of our dictionary 
```
my_dictionary.keys()
```
we will get as output `dict_keys(['key1', 'key2', 'key3', 'key4', 'key5', (0, 1)])`.

See  [this link](https://www.w3schools.com/python/python_ref_dictionary.asp) for additional methods applied to dictionaries


# Exercise 1
1. Try out yourself the previous explanation (Try typing down all of the commands yourself, it gives you some coding practice)
2. Given the following dictionaries, write short line codes to answer the following Questions (See  [this page](https://holypython.com/beginner-python-exercises/exercise-8-python-dictionaries/) for more exercises with dictionaries): 

```py
dictionary_1 ={"name": "Plato", "country": "Ancient Greece", "born": -427, "teacher": "Socrates", "student": "Aristotle"}
dictionary_2 = {"son's name": "Lucas", "son's eyes": "green", "son's height": 32, "son's weight": 25}
```
* When was Plato born?
* Change Plato's birth year from B.C. 427 to B.C. 428
* Add the key "work" to `dictionary_1`, with the values "Apology", "Phaedo", "Republic", "Symposium" in a list 
* Add 2 inches to the son's height in `dictionary_2`
* Using .get() method print the value of "son's eyes"
* Merge  dictionary_1  and  dictionary_2  into the dictionary  dictionary_merge, using the syntax:

```py
dictionary_merge = merge(**dictionary_1,**dictionary_2)
```
and print the list of keys in `dictionary_merge`


## The pandas library

We now focus on the main part of this tutorial i.e. the **pandas** library.
It is used to  efficiently manipulate tabular data such as data stored in spreadsheets or databases, and  to do statistics analysis of such data. Pandas library supports the integration of several file formats and data sources, for instance `csv`, `excel`, `sql`, `json`, etc. In this tutorial we will focus on `csv` files. For a detail  view of **pandas** library, see [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started), and for Library Highlights see [here](https://pandas.pydata.org/about/index.html)


In **pandas**, data tables are named  `DataFrame`. 

Let us start our pandas journey  by creating simple data frames from the data structure we learned above, namely, **dictionaries**. We need to import the `pandas` library first

In [None]:
import pandas as pd

Given the dictionary: 

In [None]:
dictionary =  {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 25],
        "Course": ["Data Structures", "Machine Learning", "OOPS with java"],
    }

a `DataFrame`  is created as follows

In [None]:
df = pd.DataFrame(dictionary)
df

In [None]:
df[1:3]

# Exercise 2

Get some experience with `DataFrames` by codding all the explanation bellow (Avoid coping and pasting the code from the text)

### **The head/tail attributes**

To access the first/last `i` elements of your `DataFrame` we use the `head(i)/tail(i)` attributes. This is useful when dealing with large datasets.  As an example we can access the first two elements of our `df` defined above:

```py
df.head(2)

```

### **Slicing a DataFrame**

To access a desired slice of a `DataFrame`  we use a similar syntax that for `lists`. We can of course save that slice as a new `DataFrame`, which has the same attributes as the original `DataFrame`, but with less elements. Let us for instance take the slice containing the second and third elements of our `df` defined above and save them in the new `sub_df ` `DataFrame`. We use

```py
sub_df = df[1:3]
sub_df
```


When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the `DataFrame`. 

### **Keys in a DataFrame**

Likewise for dictionaries, we can ask for the `keys` of a  `DataFrame` using the `keys()` attribute.  In our `df` example 
```py
df.keys()
```

we get as output `Index(['Name', 'Age', 'Course'], dtype='object')`.

One can also access the keys using the attribute `columns`. 

```py
df.columns

```

### **Accessing a column in a DataFrame**

Once we know the `keys` in our `DataFrame`, we can access different parts of it. To access a specific column of our `DataFrame` we use the same syntax as for dictionaries. For instance, to access the column containing all of the names in our `df` defined above we use:

```py
df['Name']

```
which will produce as an output the list with all of the names contained in the table. 


### **Accessing a row in a DataFrame**
To access a row in a data frame we use the `iloc[]` function. Imagine we want to access the  second row in our `df`, and  store it in a new variable. We use

```py
Allen_info = df.iloc[1]
print(Allen_info)
```
which will produce the output

```py
Name      Allen, Mr. William Henry
Age                             35
Course            Machine Learning
Name: 1, dtype: object
```


### **Accessing a single element in a DataFrame**

Now imagine we want to know `Allen's` age. For that we have different options, either by accessing it using the variable `Allen_info` we just defined, or by accessing it directly in the original `df`. For the former we will use

```py
Allen_info['Age']
```
whereas for the latter we use
```py
df['Age'][1]
```
with bot methods producing the output `35`. 



In [None]:
# Try this out yourself in this cell




### **Adding a new row to the DataFrame**

Likewise for lists, to add a new element to the  `DataFrame` we use the `append()` attribute. Since the `DataFrame` has several entries, we need  first to create  a dictionary, which will be added to the `DataFrame`. Let us see how this work with the following example


In [None]:
sara_info = {'Name':'Sara Remmen', 'Age': 20, 'Course':'Python programing','Grade': 10}
#adding the new dictionary
df2 = df.append(sara_info, ignore_index = True)
df2

Notice we have added a new `key` to the dictionary. This has created a new column with the `NaN` for the other rows. Furthermore,  note that using the `append()` attribute does not modify the original `DataFrame`, but we have rather create a new `df2` to store those changes. This is useful if you do not want to modify the original `DataFrame`. If we want to modify `df` directly, we can use the `loc[i]` function, where `i` is the position we want to add the new row. Let us do it explicitly in our example


In [None]:
df.loc[3] = ['Nathan S.',20, 'Julia programing']
df

### **Changing an entry in your DataFrame**

If we want for instance to change Allen's grade in our `df2` defined above we use the `at[]` function. Let us see it in practice

In [None]:
df2.at[1,'Grade'] = 8 
df2

As we see, Allen's Grade changed from `NaN` to `8`. Notice the syntax for the `at[.,.]` function is `at[row,column]`. 

### **Removing a whole row in a DataFrame**

In a similar way to adding rows, we can delete them, either saving the output in a new `DataFrame` or by modifying the original `DataFrame`. For instance, let us remove the entry for Sara's info from `df2`. Notice the label for Sara's row is `3`. In the  former case we use the syntax 
```
df3 = df2.drop(labels = 3 )
df4
```
where `3` corresponds to the Sara's row label. 

For the latter case (modifying the original `DataFrame`) we use

```py
df2.drop(labels = 3 ,inplace=True)
```
However, we have to be careful when using this option since once rows are removed, we cannot recover them, therefore, unless we are sure we want to delete a row, use the `drop()` attribute without the `inplace=True` keyword. 

In [None]:
# Some space to write your own code


### **Adding a whole column to the DataFrame**

To add a complete column we use a syntax similar to dictionaries. We have to specify all the elements of the column in a list, whose length has to match the length of the `DataFrame`. In our previous examples we would have made.

```py
df2['Major'] =  ['Physiscs and Astronomy', 'Computer Science','Physiscs and Astronomy','Maths']
df2

```



In [None]:
#some space for coding

### Merging two (or more) DataFrames
One of the main advantages of **pandas** is that it offers high performance merging and joining of data sets with the same number of columns. 
Let us see it in an specific example

In [None]:
data1 = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
data2 = [{'a': 8, 'b': 12},{'a': 23, 'b': 2, 'c': 13}]

ndf1 = pd.DataFrame(data1,
    index=[0, 1])
ndf2 = pd.DataFrame(data2,
    index=[2, 3])

frames = [ ndf1,ndf2]
result = pd.concat(frames)

result

The `index` keyword allow us to specify the index for each entry of the `DataFrame`

### Plotting with pandas
Pandas library also offers an efficient plotting attribute. Let us see an simple plotting example using  `result`. For more on plotting see the [Chart Visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)

In [None]:
result.plot("a","b",kind = 'scatter')

## **Saving Data frames**

Having gained some experience manipulating `DataFrames`, we now learn how to save the data. In this tutorial we show how to save the `DataFrame` in a `.csv` file. `csv` stands for Comma Separated Values; it is a plain text file that contains a list of data, separated by commas. 

The syntax for saving a `DataFrame` as `.csv` in Mac and Linux is:

```
df.to_csv('/...path_to_file.../name_file.csv')

```
whereas for windows, the separators are given by double backslash `\\` 

Let us save our `df2` frame defined above. If not path is specified, the file will be saved in the same folder of our jupyter notebook.

In [None]:
df2.to_csv('saving_example.csv')

## **Loading a csv file using pandas**

We can also load existing data files using **pandas** library. The syntax for this is the following:

```py
new_df = pd.read_csv("/...path_to_file.../file_name.csv")

```
This will create a new `DataFrame` named `new_df`. We can then manipulate our `new_df` as explained above. 

In [None]:
new_df = pd.read_csv('saving_example.csv')

For instance we can  add a new column to our `new_df`

In [None]:
new_df['Nationality']=['Canada','Germany','England','India']
new_df

Finally, to save the changes we did on our `csv ` file we just use our saving tool defined above. If we assign the same name to the original name the file will be overwritten  with the new changes.  

```py
new_df.to_csv('saving_example.csv')

```

# Exercise 3
### Galaxy rotation curve

The Directory `Rotation curves`, contains  data sets for the rotation curve for several Dwarf galaxies. See [rotation curves](https://github.com/keiwanjamaly/Python-Tutorials/blob/main/Rotation%20curves/readme.pdf) for a detail description of the data files. In this exercise we will  practice our **pandas** skills on those files (They will also be used in Tutorial 6 using **clases**, and in Tutorial 8, when we covering **function fitting**).

By the end of this exercise we will obtain our Dark Matter rotation curve $$v_{DM}(r) = \sqrt{v_c^2(r)-v_{star}^2(r)-v_{gas}^2(r)},$$ as well as our DM mass distribution $$M_{DM}(r) = r v^2_{DM}(r)/G . $$ We will plot them  having into account errors in the measurements as well as systematic errors. Let us divide the exercise in several steps.

1. Choose your favorite galaxy, i.e. identify the two files `RotationCurve galaxy.csv` and `RotationCurve baryons galaxy.csv`.
2. Create the **DataFrames** `df_circular` and `df_baryon` for each of the two data files. They should look similar to the following example for Galaxy `'IC2574'`:

![](./Figures/circ_velocity.png "Circular velocity")


![](./Figures/cir_vel_baryon.png "Baryonic circular velocity")


3. The radial data points used in the two `DataFrames` are different and therefore we want to have the values of `gas circ velocity` and `star circ velocity` evaluated at the same radial positions as `circ velocity`. For that we need to create  **interpolating** functions for the `gas circ velocity` and `star circ velocity` data (We will cover more on interpolating functions in Tutorial 7).  Use 
```py
from scipy.interpolate import interp1d 
```
and then create the interpolating functions with
```py
stars_circ_velocity = interp1d(r_baryon,v_star,kind = 'quadratic')
gas_circ_velocity = interp1d(r_baryon,v_gas,kind = 'quadratic')
```
where `r_baryon` is the list containing the radial points in `df_baryon` created in step 1. 
4. Now that we have our two interpolating functions, we can evaluate them at `r_circular`, the radial points given in `df_circular`. Add two new columns to `df_circular` containing the stars and gas circular velocities evaluated at  `r_circular`. At this stage our `df_circular` should look like

![](./Figures/step_4.png)

5. As explained in [rotation curves](https://github.com/keiwanjamaly/Python-Tutorials/blob/main/Rotation%20curves/readme.pdf), some of the measured errors (i.e. some values in `circ velocity error`) have been underestimated. In this step we will  include a systematic error at the level of 5% of the last measured velocity point (i.e. the last data point of column `cir velocity` in our `df_circular`). Add a new column in `df_circular` called `syst error` whose values contain the systematic error. Then, add a new column  `total error`, containing the total error computed as the sum in quadratures of `sys error` and   `circ velocity error`. After this step, our `df_circular` should look like

![](./Figures/step_5.png)

6. Let us now add a column with the DM velocity, which is computed from $$v_{DM}(r) = \sqrt{v_c^2(r)-v_{star}^2(r)-v_{gas}^2(r)},$$ We have to be careful however, since $v_{DM}^2(r)$ could be negative, if that is the case,  we choose to change sign of $v_{gas}^2(r)$.  Similarly, compute $M_{DM}(r)$ and add it as new column (hint: Newton's constant GN = 4.302e-6  km^2/s^2*kpc/Msol ). After this steps,  our `df_circular` should look like:

![](./Figures/step_6.png)

7. Plot $v_{DM}(r)$ including error bar given by the `total error` column.  (hint: use the `errorbar` plotting function in `matplotlib`)
8. Plot $M_{DM}(r)$. 