# <div align='center'>Introduction to Python for data analysis: NumPy and pandas libraries</div>

NumPy and pandas are two fundamental libraries for data analysis. They are used to efficiently manipulate and prepare data for analysis and visualization. These tools are used to clean the data and make operations.

# Contents

1. <a href="#numpy">NumPy</a>
    1. <a href='#ndarray'>Introduction to ndarray</a>
    2. <a href='#print_array'>Printing arrays</a>
    3. <a href='#create_array'>How to create a ndarray?</a>
    4. <a href='#indexing'>Indexing, slicing and iterating</a>
    5. <a href='#shaping'>Changing the shape of an array</a>
    6. <a href='#concatenating'>Concatenating, joining and stacking arrays</a>
    7. <a href='#adding'>Adding, sorting and removing elements</a>
    8. <a href='#random'>Random number generation</a>
3. <a href="#pandas">Pandas</a>
    1. <a href='#series'>Series</a>
    2. <a href='#dataframes'>DataFrames</a>
        1. <a href='#csv'>Create a DataFrame from a CSV file</a>
        2. <a href='#dict'>Create a DataFrame from a dictionary</a>
        3. <a href='#array'>Create a DataFrame from a NumPy array</a>
        4. <a href="#summarize">Summarize data</a>
        5. <a href="#combine">Combining and merging DataFrames</a>
        6. <a href='#write'>Save DataFrames to a file and convert to NumPy array</a>
        7. <a href="#clean">Handling missing data</a>
        8. <a href="#transform">Transform and replace data</a>
        9. <a href="#remove">Remove unwanted rows/columns</a>
        10. <a href="#query">Examine the data and make selections and queries</a>
        11. <a href="#add">Add new columns/rows</a>
        12. <a href="#plotting">Plotting</a>
        13. <a href="#other">Other helpful attributes and functions</a>

## <div id="numpy">1. NumPy</div>

[NumPy](https://numpy.org/) is a powerful library that offers comprehensive mathematical functions, random number generators, linear algebra routines and much more. Its main data object is a multidimensional array, known as ```ndarray```, which is significantly faster than the Python built-in list data type. A Python list can contain different data types, while in a NumPy array, all elements should be homogeneous.

**Why use NumPy over Python lists?**

NumPy arrays are faster and consume less memory than Python lists.

Let's import NumPy and introduce its most common and useful features.

In [None]:
import numpy as np

If the above didn't work, please run the following on a terminal/console:
```
pip install numpy
```
or
```
pip3 install numpy
```

### <div id="ndarray">1.A Introduction to ndarray</div>

A ```ndarray``` could be considered as a table containing elements of the same type, indexed by non-negative integers.
These arrays are multidimensional and the dimensions of an ```ndarray``` object are called axes. A ```ndarray``` can be created with the ```np.array()``` function (not to be confused with the ```array.array``` arrays). Let's take a look at a couple of examples:

<u>Example of a ndarray with a single axis:</u>

In [None]:
a1d = np.array([1, 2, 0])
print(a1d)

The above array has one axis and three elements (i.e. its lenght is 3).

<u>Example of a ndarray with two axes:</u>

In [None]:
a2d = np.array([[1, 2, 0], [2, 0, 3]])
print(a2d)

The ```a2d``` array has 2 axes, the first one has a lenght of 2 and the second one a lenght of 3.

<u>Let's take a look a some useful methods:</u>

1. ```ndarray.ndim```:

    Returns the number of axes (i.e. dimensions) of an array. Let's confirm it:

In [None]:
print(f'{a1d.ndim = }')
print(f'{a2d.ndim = }')

2. ```ndarray.shape```:

    Returns a tuple containing the lenght of each dimension. For a matrix with ```n``` rows and ```m``` columns, ```shape``` will be ```(n,m)```. The length of the tuple is hence the number of axes (i.e. it agrees with ```ndim```). Let's confirm it:


In [None]:
print(f'{a1d.shape = }')
print(f'{a2d.shape = }')

3. ```ndarray.size```:

    Returns the total number of elements in the array. This is equal to the product of all the elements in the shape tuple. Let's confirm it:


In [None]:
print(f'{a1d.size = }')
print(f'{a2d.size = }')

### <div id="create_array">1.B How to create a ndarray?</div>

There are a few different ways to create a NumPy array. The main function to create one is the ```arrray()``` function to which we can provide a list or a tuple. We saw a couple of examples already using a list. Let's see a couple of examples using a tuple:

In [None]:
my_array_1 = np.array((2, 3, 1))
print(my_array_1)

In [None]:
my_array_2 = np.array(((2, 3, 1), (0, 1, 2)))
print(my_array_2)


**Note:** The type of the array is deduced from the type of the provided elements.

Sometimes, we know the shape and the size of the array that we need, but we don't necessarily have the elements yet. For such cases, we have a few different functions to create arrays with placeholder elements (this is recommended over increasing the size of an array which is an expensive operation). We will see only two of them:

1. ```zeros(shape)```:

It creates an array full of zeros using the provided shape. Example:

In [None]:
nzeros = np.zeros((2, 3))
print(nzeros)

2. ```ones(shape)```:

Similarly to the ```zeros()``` function, ```ones()``` creates an array full of ones using the provided shape. Example:

In [None]:
nones = np.ones((2, 3))
print(nones)

**Note:** By default, the type of the created array is ```float64```, but it can be specified via the keyword argument ```dtype``` (this is true for all the functions we have seen, i.e. ```array()```, ```zeros()``` and ```ones()```).

If we wish to create a sequence of numbers, we can use the ```arange()``` function which is analogous to the Python built-in range function but in this case it retuns a ndarray. Example:

In [None]:
my_sequence = np.arange(1, 30)
print(my_sequence)

If we wish to create a sequence of float numbers, we can use the ```linspace()``` function. For example, the following will create an array of 11 numbers from 0 to 1:

In [None]:
my_sequence_float = np.linspace(0, 1, 11)
print(my_sequence_float)

**Note:** We could have used the ```arange()``` function providing a third parameter (i.e. a float number as the step size) but sometimes it is not possible to predict the number of elements obtained, due to the finite floating point precision.

### <div id="print_array">1.C Printing arrays</div>

One-dimensional arrays are printed as rows, bidimensionals arrays as matrices and tridimensionals arrays as lists of matrices. Examples:

1. 1D array:

In [None]:
a1d = np.arange(10)
print(f'{a1d = }')

2. 2D array:

In [None]:
a2d = np.array([[1, 2, 3], [2, 4, 5]])
print(f'{a2d = }')

3. 3D array:

In [None]:
a3d = np.array([[[1, 2, 3], [2, 4, 5]], [[0, 2, 3], [1, 5, 8]]])
print(f'{a3d = }')

**Note:** If an array is too large to be fully displayed, NumPy skips the central part of the array. Example:

In [None]:
print(np.arange(10000))

### <div id="operations">1.D Arithmetic operations</div>

Arithmetic operations on arrays are applied element-wise and a new array is created and filled with the result of the operation. Let's create two arrays and then do some operations:

In [None]:
a = np.array([1, 3, 5, 6, 8, 9, 13, 15, 12, 24])
b = np.arange(0, 10)
print(f'{a = }')
print(f'{b = }')

In [None]:
# subtraction
c = a - b
print(f'c = a - b = {c}')

In [None]:
# Multiplication
c = a*2
print(f'c = a*2 = {c}')

**Note:** If we wish to perform a matrix product, we need to use the ```@``` operator (available in Python>=3.5) or the ```dot()``` function. Example:

In [None]:
matrix_a = np.array([[0, 2], [1, 3]])
matrix_b = np.array([[0, 1], [1, 0]])
matrix_product = matrix_a @ matrix_b
print(f'{matrix_product = }')

In [None]:
# Exponentiation
c = a**2
print(f'c = a**2 = {c}')

In [None]:
# Check which elements in "a" are even
c = a % 2 == 0
print(f'c = (a % 2 == 0) = {c}')

**Note:** The ```+=``` and ```*=``` operations modify an existing array instead of creating a new one.

Let's take a look at some helpful NumPy built-in mathematical methods and functions:

1. ```sum()```:

In [None]:
print(f'{a = }')
print(f'{a.sum() = }')

2. ```mean()```:

In [None]:
print(f'{a.mean() = }')

3. ```std()```:

In [None]:
print(f'{a.std() = }')

4. ```min()```:

In [None]:
print(f'{a.min() = }')

5. ```max()```:

In [None]:
print(f'{a.max() = }')

**Note:** By default, the above methods use all elements in the array. However, by specifying the axis parameter we can apply an operation along the specified axis. For example:

In [None]:
test_axis_arg = np.array([[1, 2], [3, 4]])
print(f'{ test_axis_arg.sum(axis=0) = }')  # sum of each column
print(f'{ test_axis_arg.sum(axis=1) = }')  # sum of each row


6. ```np.multiply(a1, a2)```:

In [None]:
print(f'{np.multiply(a, b) = }')
print(f'{a * b = }')

**Note:** The above is equivalent to ```a*b```.

7. ```np.divide(a1, a2)```:

In [None]:
print(f'{np.divide(b, a) = }')

**Note:** The above is equivalente to ```b/a```.

8. ```np.absolute(x)```:

Take the absolute value (element-wise).

In [None]:
my_array = np.array([-1, -2, -100, 5, 7, -23])
print(f'{np.absolute(my_array) = }')

9. ```np.log(x)```:

Apply the natural logarithm (element-wise).

In [None]:
print(f'{np.log(a) = }')

**Note:** You can find more details and mathematical functions (called universal functions) [here](https://numpy.org/doc/stable/reference/ufuncs.html).

### <div id="indexing">1.E Indexing, slicing and iterating</div>

One-dimensional arrays can be indexed, sliced and iterated over, as with any other Python sequence. Let's see a few examples:

In [None]:
a = np.arange(5, 15, 1)
print(f'{a = }')
print(f'{a[2] = }')

**Note:** As with Python lists, indexing starts at zero.

In [None]:
print(f'{a[1:4] = }')

Let's now iterate over all elements and print them:

In [None]:
for i in a:
    print(i)

As mentioned above, multidimensional arrays have one index per axis. To retrieve elements, we need to give indices separated by commans. Let's see a few examples:

In [None]:
a = np.array([[1, 2, 3, 4], [2, 6, 8, 9], [3, 2, 1, 7]])
print(f'{a = }')

In [None]:
print(f'{a[1, 2] = }')

In [None]:
print(f'{a[0:2, 0] = }')

In [None]:
print(f'{a[:, 0] = }')

**Note:** If we don't provide indexing for a subset of the axes, the last axes are retrieved fully. Let's see an example:

In [None]:
print(f'{a[2] = }')

The above is equivalent to ```print(f'{a[2, :] = }')```.

We can also use ```...``` to request full indexing for those axes not specified. Let's imagine we have an array ```my_array``` containing 3 axes, then:

```my_array[..., 2]``` is equivalent to ```my_array[:, :, 2]```, and ```my_array[1, ...]``` is equivalent to ```my_array[1, :, :]```.

Multidimensional arrays are iterated with respect to the first axis. For example:

In [None]:
for row in a:
    print(row)

Nevertheless, if we wish to iterate over each element in a multidimensional array, we can use the ```flat``` attribute:

In [None]:
for element in a.flat:
    print(element)

<u>**Boolean indexes:**</u>

We can use boolean indexes to explicitely choose which elements from an array we want and which ones we don't. For this, we need to create an array of booleans of the same shape of the array we want to take the data from. For example, let's create an array containing negative and positive values, then let's figure out which of those values are negative and which ones are not by obtaining a boolean array. Then, let's use that boolean array as boolean indexes to extract the negative values from the original array.

In [None]:
a = np.array([0, -1, 1, -2, 8, -3, -10, 32, 14, -18, 54, -79])
b = a < 0
print(f'{b = }')
negative_a = a[b]
print(f'{negative_a = }')

**Note:** Please note that we could simply have done this as well: ```negative_a = a[a < 0]```.

### <div id="shaping">1.F Changing the shape of an array</div>

Let's see a few different ways of changing the shape of an array. Let's first create an array that we will use for all these examples.

In [None]:
a = np.array([[1, 2, 3, 4], [2, 6, 8, 9], [3, 2, 1, 7]])


1. ```ravel()``` and ```flatten()```:

Both are used to obtained an array collapsed into one dimension.

```flatten()``` always returns a copy of the array, while ```ravel()``` returns a view of the original array whenever possible (and original array is not changed). The thing is that if the array returned by ```ravel()``` is modified, then it may modify the elements in the original array (since the view method creates a new array object that looks at the same data!). This will never happen with ```flatten()```.


In [None]:
print(f'{a.shape = }')
flat_a = a.flatten()
print(f'{flat_a = }')
print(f'{flat_a.shape = }')

2. ```reshape()```:

Use ```reshape()``` to give a new shape to an array without changing its data (again using the view method). Examples:

First, let's reshape the ```a``` array into two axes, the first one having only two indexes, meaning there must be 6 elements in the second axis.

In [None]:
a_reshaped_1 = a.reshape(2, 6)
print(f'{a_reshaped_1 = }')
print(f'{a_reshaped_1.shape = }')

Let's now take the ```a``` array again and reshape into a (6, 2):

In [None]:
a_reshaped_2 = a.reshape(6, 2)
print(f'{a_reshaped_2 = }')
print(f'{a_reshaped_2.shape = }')

**Note:**
- We can omit one of the sizes and use ```-1``` to automatically deduce the corresponding size. For the example above, ```a.reshape(6, -1)``` is equivalent to ```a.reshape(6, 2)```.
- We can change the shape of an array in-place with the ```resize()``` function.

3. Transpose an array:

Let's create a new array for this example:

In [None]:
a = np.array([[1, 2, 3], [5, 7, 9]])
print(f'{a = }')

And now, let's transpose it with the ```T``` method (a copy is made):

In [None]:
print(f'{a.T = }')

### <div id="concatenating">1.G Concatenating, joining and stacking arrays</div>

Let's create two arrays to be used in examples:

In [None]:
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
print(f'{a = }')
print(f'{b = }')

Let's say we want to add the ```b``` array as an extra element to the array ```a``` along axis 0 (i.e. add it as another row). In that case we can do the following:

In [None]:
c = np.concatenate((a, b), axis=0)
print(f'{c = }')

**Note:** We can obtain the same result using the ```np.vstack()``` function:

In [None]:
c_prime = np.vstack((a, b))
print(f'{c_prime = }')

Let's say now we want to use the ```b``` array to extend each row in the array ```a``` (i.e. create a new column in the array ```a```). In that case we can do the following:

In [None]:
c = np.concatenate((a, b.T), axis=1)
print(f'{c = }')

**Note:**

- I needed to transpose the ```b``` array such that both arrays agree on the number of rows.
- We can obtain the same using the ```np.hstack()``` function (```np.hstack((a, b.T))```).

For more information and examples, you can see [numpy.concatenate](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html).

### <div id="adding">1.H Adding, sorting and removing elements</div>

#### 1.H.a Adding

There are two methos to add elements to an array:

- ```np.insert(arr, obj, values, axis=None)```: Add elements to an specific index where:

    - ```arr```: input array
    - ```obj```: object that defines the index or indices before which ```values``` is inserted.
    - ```values```: values to insert into ```arr```.
    - ```axis``` (optional): axis along which to insert ```values```. If axis is ```None``` then ```arr``` is flattened first.
    - the output is a copy of ```arr``` with ```values``` inserted (i.e. the insert does not occur in-place and a new array is returned).

Let's look at an example:

In [None]:
a = np.array([0, 2, 3, 4, 5, 6])
c = np.insert(a, 1, 1)
print(f'{c = }')

You can learn more about the ```insert()``` function in [numpy.insert](https://numpy.org/doc/stable/reference/generated/numpy.insert.html).

- ```np.append(arr, values, axis=None)```: Add values to the end of the array.

Let's use the ```c``` array again and add ```7``` at the end of the array:

In [None]:
c = np.append(c, 7)
print(f'{c = }')

You can learn more about the ```append()``` function in [numpy.append](https://numpy.org/doc/stable/reference/generated/numpy.append.html).

#### 1.H.b Sorting

We can use the ```sort(arr, axis=-1, ...)``` function to sort numbers in an array in ascending order, where:

- ```arr```: it is the array to be sorted.
- ```axis``` (optional): axis along which to sort (int or None). If ```None```, the array is flattened before sorting. The default is -1, which sorts along the last axis.

Let's create an array and sort it:

In [None]:
a = np.array([2, 0, 4, 6, 3, 9, 7])
a = np.sort(a)
print(f'{a = }')

**Note:** If we wish to sort in descending order, we can do the following instead: ```a = np.flip(np.sort(a))```. The ```np.flip()``` function reverses elements of an array along an axis. If you don’t specify the axis, it will reverse the contents along all axes.

#### 1.H.c Deleting

We can use the ```np.delete(arr, obj, axis=None)``` function to delete elements from an array, where ```obj``` indicates the indices of the elements to remove along the chosen axis. Example:

In [None]:
a = np.array([2, 0, 4, 6, 3, 9, 7])
a = np.delete(a, [0, 1])
print(f'{a = }')

### <div id="random">1.I Random number generation</div>

The possibility of generating random numbers (actually repeatable pseudo-random numbers) is crucial for many applicatoins, for example machine learning.

Before we can generate numbers, we need to get an instance of a Generator, which we can do with ```numpy.random.default_rng(seed=None)``` or alternatively in the following way:

In [None]:
from numpy.random import default_rng
gen = default_rng(1)

**Note:** A seed needs to be set to ensure reproducibility (if desired). If a seed is set, the same sequence of numbers will be generated everytime an instance of the generator (using the same seed) is created.

As an example, let's create an array with shape ```(2, 3)``` filled with floating random numbers:

In [None]:
a = gen.random((2, 3))
print(f'{a = }')

Let's generate now an array of shape ```(2, 3)``` filled with random integer numbers between 0 and 9:

In [None]:
a = gen.integers(10, size=(2, 3))
print(f'{a = }')

**Note:**

We can specify the start and end of the range used to generate integer values with ```integers()```:

```
random.Generator.integers(low, high=None, size=None, dtype=np.int64, endpoint=False)
```

where:

- ```low```: Lowest (signed) integers to be drawn from the distribution (unless high=None, in which case this parameter is 0 and this value is used for high).
- ```high```: If provided, one above the largest (signed) integer to be drawn from the distribution.

```integers()``` returns random integers from ```low``` (inclusive) to ```high``` (exclusive), or if ```endpoint=True```, ```low``` (inclusive) to ```high``` (inclusive).

Among other things, we can generate numbers from a Normal distribution. For example, let's generate 10000 numbers from a Normal distribution with ```mu=5``` and ```sigma=1```:

In [None]:
dist = gen.normal(5, 1, 10000)
print(f'{dist = }') 

For more information about ```numpy.random``` see [here](https://numpy.org/doc/stable/reference/random/index.html).

## <div id="pandas">2. Pandas</div>

[Pandas](https://pandas.pydata.org/) is a library that enables the manipulation of data in a fast, powerful and easy way. Pandas has two main types of Data Structures*:

- ```Series```: 1D labeled array where elements must be of the same data type. Once created, its size cannot be changed.
- ```DataFrame```: 2D labeled table where elements can have different data type, and can be removed/added.

*Data Structures allow you to organize, process and store data.

To install pandas, see [getting_started](https://pandas.pydata.org/getting_started.html) (needs NumPy, among other libraries).

You can start using pandas in your codes, by simply adding:

In [None]:
import pandas as pd

If the above didn't work, please run the following on a terminal/console:
```
pip install pandas
```
or
```
pip3 install pandas
```

### <div id="series">2.A Series</div>

Series is a one-dimensional labeled array whose object size cannot be changed.

Let's create a Series from an array of integer values which we will label:

In [None]:
my_series = pd.Series(data=[22, 1, 0], index=["age", "ndogs",  "ncats"])

The ```index``` parameter accepts a ```list``` that allows you to label the data.

The ```data``` argument can take any of the following data types:

- ```dict```
- ```list```
- ```ndarray```

Values inside the dictionary and elements in the array can be of ```int```, ```float``` or ```bool``` type.

There is also a ```name``` parameter that allows you to name your Series.

If ```data``` is of ```dict``` type and ```index``` is not provided, the dictionary keys will be used as index labels.

Please see [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) for more information.

### <div id="dataframes">2.B DataFrames</div>

With Pandas we can take data (from a CSV/TSV file, a SQL database, a dictionary, a NumPy array, etc) and create a Python object with rows and columns called ```DataFrame``` that resembles a table (like one would find in Microsoft Excel, for example).

There are many functions for opening different file formats. We will see only how to create a DataFrame out of a CSV file, you can find all other functions [here](https://pandas.pydata.org/docs/reference/io.html).

#### <div id="csv">2.B.a Create a DataFrame from a CSV file</div>

There are many functions for opening different file formats. We will see only how to create a DataFrame out of a CSV file, you can find all other functions [here](https://pandas.pydata.org/docs/reference/io.html).

Let's create a DataFrame out of the ```cost_of_living.csv``` file (from [www.worldata.info](https://www.worlddata.info/cost-of-living.php)) which has a list of countries, each with its ```cost_index```, ```monthly_income``` and ```purchasing_power_index```.

In [None]:
df = pd.read_csv('cost_of_living.csv')
df.head(4)

The first line, ```df = pd.read_csv('cost_of_living.csv')``` created the DataFrame, while the second one ```df.head(4)``` printed the first 4 rows. If one would like to print the last ```n``` rows, one would need to use ```df.tail(n)``` instead.

**Note:** each entry/row in our DataFrame has an index value. Indexing in DataFrames start at zero. For example, in our DataFrame above, the row corresponding to Switzerland corresponds to the index 1.

Let's say we wanted only the ```country``` and ```monthly_income``` columns. In that case, we can use the ```usecols``` keyword to pass the list of columns we are interested in:

In [None]:
df_income = pd.read_csv('cost_of_living.csv', usecols = ['country', 'monthly_income'])
df_income.head(4)

**Note #1:** if the CSV file doesn't have column names, one can define them with the ```names``` keyword when using the  ```read_csv()``` function.

For more options, see [pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

**Note #2:** If you already created a DataFrame and want to select a subset of columns, you can do something like the following:

```
new_df = df[[col1, col2]]
```

```new_df``` will have only ```col1``` and ```col2``` from the ```df``` DataFrame.

For example, let's get only the ```country``` column from ```df_income```:

In [None]:
countries_df = df_income[['country']]
countries_df.head(5)

Please note that when selecting a single column, one can get a ```DataFrame``` using ```df[['colname']]``` or a ```Series``` using ```df['colname']```. For the example above, let's get now a Series from the ```country``` column:

In [None]:
countries_series = df_income['country']

We can now confirm their types:

In [None]:
print(f'{type(countries_series) = }')
print(f'{type(countries_df) = }')

#### <div id="dict">2.B.b Create a DataFrame from a dictionary</div>

One can also create a DataFrame out of a dictionary. In order to convert a certain Python object (dictionary, NumPy array, etc) to a DataFrame, we need to use ```pd.DataFrame()```. For all the options, please see [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [None]:
my_dict = {
    'column1': [1, 2, 3, 4],
    'column2': [3, 4, 2, 1],
    'column3': [3, 5, 1, 6],
}
df = pd.DataFrame(my_dict)
df.head()

Alternatively, one could create a DataFrame from a dictionary of Series:

In [None]:
my_series_dict = {
    'column1': pd.Series([1, 2, 3, 4]),
    'column2': pd.Series([3, 4, 2, 1]),
    'column3': pd.Series([3, 5, 1, 6]),
}
df = pd.DataFrame(my_dict)
df.head()

One could also create a DataFrame using a list of dictionaries:

In [None]:
data = [
    {'column1': 1, 'column2': 3, 'column3': 3},
    {'column1': 2, 'column2': 4, 'column3': 5},
    {'column1': 3, 'column2': 2, 'column3': 1},
    {'column1': 4, 'column2': 1, 'column3': 6},
]
df = pd.DataFrame(data)
df.head()

#### <div id="array">2.B.c Create a DataFrame from a NumPy array</div>

In [None]:
array = np.array([[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]])
df = pd.DataFrame(array, columns=['a', 'b', 'c', 'd'])
df.head()

Note: The ```columns``` keyword was used to label each column.

#### <div id="summarize">2.B.d Summarize data</div>

Let's read again the ```cost_of_living.csv``` file into a DataFrame and get some statistics/summaries:

In [None]:
df = pd.read_csv('cost_of_living.csv')
df.head()

**<u>Number of rows and columns</u>:**

Use ```df.shape``` to get the number of rows (107) and columns (4):

In [None]:
df.shape

If we only wish to know the number of rows, we can use ```len(df)```:

In [None]:
len(df)

**<u>List of column names</u>:**

Use ```df.columns``` to get the list of column names:

In [None]:
df.columns

**<u>Statistics for numerical columns</u>**

Another very useful command is ```df.describe()``` which provides summary statistics for numerical columns:

In [None]:
df.describe()

There are many other methods that can be used to get statistics from a DataFrame (a Series or a column). Here is some of them:

- ```df.sum()```: returns the sum of each column
- ```df.mean()```: returns the mean of each column
- ```df.median()```: returns the mediam of each column
- ```df.std()```: returns the standard deviation of each column
- ```df.corr()```: returns the correlation between columns in a data frame
- ```df.count()```: returns the number of non-null values in each column
- ```df.min()```: returns the lowest value in each column
- ```df.max()```: returns the highest value in each column

You can find other functions in [api-dataframe-stats](https://pandas.pydata.org/pandas-docs/version/0.20.2/api.html#api-dataframe-stats).

**<u>Value counts</u>:**

The ```value_counts()``` function can be used to count the number of occurrences. For example, let's open the ```users.csv``` file which for a few different 'users', their age and country were recorded. Let's use the ```value_counts()``` function to count the number of times each age appears:

In [None]:
users = pd.read_csv('users.csv')
users['age'].value_counts()

As you can see, the most common ages are 19 and 32, both appearing 3 times.

#### <div id="combine">2.B.e Combining and merging DataFrames</div>

**<u>Concatenating DataFrames</u>:**

The ```concat()``` function can be used to concatenate DataFrames. Use the following syntax to add rows from DataFrames ```df1``` and ```df2``` to a new DataFrame (the same columns should be available in both DataFrames):

```
pd.concat([df1, df2], ignore_index=True)
```

```ignore_index=True``` will reset the indexing to force unique indexes.

Example:

In [None]:
dict_1 = {'A': [1, 2], 'B': [1, 3]}
dict_2 = {'A': [5, 8], 'B': [2, 9]}
df1 = pd.DataFrame(dict_1)
df2 = pd.DataFrame(dict_2)
df = pd.concat([df1, df2], ignore_index=True)
df.head()

Use the following syntax to add columns from DataFrames to a new DataFrame (input DataFrames should have the same number of rows and indexing):

```
pd.concat([df1, df2], axis=1)
```

Example:

In [None]:
dict_1 = {'A': [1, 2], 'B': [1, 3]}
dict_2 = {'C': [5, 8], 'D': [2, 9]}
df1 = pd.DataFrame(dict_1)
df2 = pd.DataFrame(dict_2)
df = pd.concat([df1, df2], axis=1)
df.head()

**<u>Merging DataFrames:</u>**

Let's say we have two DataFrames with different (suplementary) information about the same countries. Let's mimic that by creating another DataFrame with the ```cost_index``` and ```purchasing_power_index``` columns from the ```cost_of_living.csv``` file and merge it with ```df_income``` that we created before (which contains only the ```country``` and ```monthly_income``` columns). Here we will use the ```pd.merge()``` method.

In [None]:
df_rest = pd.read_csv('cost_of_living.csv', usecols = ['country', 'cost_index', 'purchasing_power_index'])
df_rest.head(4)

In [None]:
df_full = pd.merge(df_income, df_rest, how='left')
df_full.head(4)

Note: ```how=left``` is used only to be able to compare with the ```join()``` function (see below). To learn about ```how``` and other attributes, see [pandas.DataFrame.merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html).

As you can see, ```df_full``` has exactly the same information as the DataFrame that we created using all columns (with just a slightly different order for the columns). We can change the order of the columns in the following way:

In [None]:
df_full = df_full[['country', 'cost_index', 'monthly_income', 'purchasing_power_index']]
df_full.head(4)


Alternatively, we can use the ```join()``` function to add columns from two DataFrames sharing a column. ```join()``` can be used to join columns with another DataFrame either on index or on a key column.

In [None]:
df = df_income.join(df_rest.set_index('country'), on='country')
df.head()

As you can see, we obtained the same result as with the ```merge()``` function. For more details about the ```join()``` function, see [pandas.DataFrame.join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html).

#### <div id="write">2.B.f Save DataFrame to a file and convert to NumPy arrays</div>

DataFrames can be saved into different types of files (CSV, Excel, JSON, SQL table, etc). They are of the form ```df.to_filetype(filename)```. To find all of them, please see [pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

It is also possible to convert a DataFrame to a NumPy ```ndarray``` by using the ```to_numpy()``` function.

#### <div id="clean">2.B.g Handling missing data</div>

More often than not, some information is partially missing in our input data. Let's create a DataFrame from a dictionary that has some missing values:

In [None]:
data_dict = {
    'name': ['John', 'Maria', 'Jennifer', 'Matthias', 'Matt'],
    'age': [23, 32, 45, 52, 39],
    'nationality': ['England', 'Argentina', 'USA', None, 'Canada'],
    'civil_status': ['married', None, 'single', 'divorced', 'single'],
    'stars': [2, 3, 4, 5, None],
}
data_rankings = pd.DataFrame(data_dict)
data_rankings.head()

As you can see, missing/null information is denoted by ```NaN```.


We can check for missing values with the ```isnull()``` function. This will return ```True``` for a missing value and ```False``` for non-missing values, as you can see below:

In [None]:
data_rankings.isnull()

We can count how many missing values per column are present in the DataFrame in the following way:

In [None]:
data_rankings.isnull().sum()

Note: Alternatively, if we want to identify all non-missing values, we can use ```notnull()``` instead.

Now, you can decide to remove rows with a missing value using ```dropna()```, remove columns with missing values with ```dropna(axis=0)``` or fill the missing values with other values using ```fillna(x)``` which fills the missing values with ```x```. ```x``` can be anything, for example it could be ```df.mean()``` (you can use pretty much any stat function).

For our example above, let's replace missing information for the columns ```'age'``` and ```'stars'``` by their averages. Then, let's remove rows for which there is still missing data (effectively removing rows with missing data in columns ```'name'```, ```'nationality'``` and ```'civil_status'```.)

Let's first replace all missing values for columns ```'age'``` and ```'stars'``` by their averages:

In [None]:
# First create a dictionary collecting the average for each desired column
values = {key: data_rankings[key].mean().astype(int) for key in ['age', 'stars']}

# Let's then fill missing values
df = data_rankings.fillna(value=values)
df.head()

Let's now remove any row still having missing info:

In [None]:
df = df.dropna()
df.head()

Note: If one wishes to drop rows with missing info only on specific columns, we can use the ```subset``` argument, for example:

```
df = df.dropna(subset=['name', 'nationality'])
```

#### <div id="transform">2.B.h Transform and replace data</div>

**<u>The ```replace()``` function</u>:**

If you need to replace certain values by always the same set of values, you can use the ```replace()``` function. Here are two examples:

i. Let's change all zeros by ones:

In [None]:
data_dict = {
    'name': ['John', 'Maria', 'Jennifer', 'Matthias', 'Matt'],
    'stars': [0, 2, 4, 3, 5],
}
df = pd.DataFrame(data_dict)
df.head()

In [None]:
df.replace(0, 1, inplace=True)
df.head()

Here I used ```inplace=True``` to replace values directly on the DataFrame, instead of having to make a new copy of the DataFrame.

ii. Let's now replace several values at once:

In [None]:
data_dict = {
    'name': ['John', 'Maria', 'Jennifer', 'Matthias', 'Matt'],
    'stars': ['one', 'two', 'three', 'four', 'five'],
}
df = pd.DataFrame(data_dict)
df.head()

In [None]:
df.replace(['one', 'two', 'three', 'four', 'five'], [1, 2, 3, 4, 5], inplace=True)
df.head()

Tip: if you need to rename column names, you could do the following:

```
df.rename(columns={'old_name_1': 'new_name_1', 'old_name_2': 'new_name_2'}, inplace=True)
```

**<u>Replace values depending on a condition with the ```where()``` function from NumPy</u>:**

The syntax is the following:

```
where(condition, then, else)
```

Let's once again use the ```data_rankings``` DataFrame and convert the age values to the following two ranges: ```'0-45'``` and ```'45+'```:

In [None]:
import copy
df = copy.deepcopy(data_rankings)
df['age'] = np.where(data_rankings['age'] > 45, '45+', '0-45')
df.head()

As you can see, Matthias was assigned to the ```'45+'``` group since he is 52 years old, while the rest to the ```'0-45'``` group.

Note: I made a copy of the ```data_rankings``` DataFrame to preserve it.

**<u>Replace values using the ```applymap()``` function</u>:**

Sometimes, it is needed to apply a custom function to each value/element in a DataFrame. For such cases, the ```applymap()``` function can be used and it works similarly to the Python built-in ```map()``` function, i.e. a function is applied to all elements in the DataFrame. Let's take a look at the following DataFrame:

In [None]:
data_dict = {
    'distance_1 (m)': [10325, 45212, 829252, 87328],
    'distance_2 (m)': [76290, 621852, 42196, 980171],
}
df = pd.DataFrame(data_dict)
df.head()

Imagine that all values correspond to meters but we wish to convert them to kilometers. In that case, we could do the following, where we define a function (```m_to_km()```) to convert meters to kilometers, give it as input to the ```applymap()``` function, and then rename the column names:

In [None]:
def m_to_km (value):
    return value * 0.001

df = df.applymap(m_to_km)
df.rename(columns={'distance_1 (m)': 'distance_1 (km)', 'distance_2 (m)': 'distance_2 (km)'}, inplace=True)
df.head()

#### <div id="remove">2.B.i Remove unwanted rows/columns</div>

You can use the ```drop()``` function to remove unwanted columns or rows from a DataFrame:

```df.drop(to_drop, inplace=True, axis=1)```

Let's recycle one of the examples above (```data_rankings```) and remove the second row and reset the indexes:

In [None]:
df = pd.DataFrame(data_dict)
df.drop(2, inplace=True)
df.reset_index(inplace=True)  # force to reset indexing, so there are no missing indexes
df.head()

**Note:** We could have done the following as well: ```df.drop(df.index[2], inplace=True)```.

Let's now remove the ```'civil_status'``` column from the ```data_rankings``` DataFrame:

In [None]:
df = copy.deepcopy(data_rankings)
df.drop(['civil_status'], inplace=True, axis=1)
df.head()

Alternatively, we could do the following as well: ```df.drop(columns=['civil_status'], inplace=True)``` or make a copy of the DataFrame selecting only the desired columns: ```new_df = df[['name', 'age', 'nationality', 'stars']]```.

#### <div id="query">2.B.j Examine the data and make selections and queries</div>

**<u>Sort data</u>:**

Rows can be ordered based on columns content.

- ```df.sort_values(column_1)``` sorts the ```df``` DataFrame in ascending order based on the values in the ```column_1``` column.
- ```df.sort_values(column_1, ascending=False)``` sorts the ```df``` DataFrame in descending order based on the values in the ```column_1``` column.
- ```df.sort_values([column_1, column_2], ascending=[True, False])``` sorts the values by ```column_1``` in ascending order, then sorts the values by ```column_2``` in descending order.

**<u>Select rows by index or label</u>:**

There are two ways of accessing the data from specific rows, by label- or position-based indexing. In the former, we pick rows matching given labels in the index axis of the DataFrame using ```loc[]```, while in the later we access rows based on their position in the DataFrame using ```iloc[]```. Let's take a look at the following DataFrame:

In [None]:
df = pd.read_csv('cost_of_living.csv')
countries_indexing = df.set_index('country')  # this is to use the country column as indexes
countries_indexing = countries_indexing.rename_axis(None)  # not necessary, it is just easier to highlight that the first column is the index axis (by removing its name)
countries_indexing.head()

I created again a DataFrame out of the ```'cost_of_living.csv'``` file and instead of having numerical index values, I used the ```'country'``` column as indexes (since it has unique values). Now, we can use the ```loc[]``` to obtain a DataFrame only for Switzerland and Iceland:

In [None]:
countries_indexing.loc[['Switzerland', 'Iceland']]

Now, we can get the same rows by using their numerical positions (1 and 4, respectively) using ```iloc[[1, 4]]```:

In [None]:
countries_indexing.iloc[[1, 4]]

Please note that we could also have done the following to get the same rows (since the labels in ```df``` match their positions):

In [None]:
df.loc[[1, 4]]

**<u>Selecting rows satisfying a condition</u>:**

In DataFrames, we can easily select data satisfying certain requirements. Let's re-use again the ```data_rankings``` DataFrame and get a new DataFrame listing people that have 3 or more stars:

In [None]:
df = data_rankings[data_rankings['stars'] > 2]
df.head()

**Note:** By doing ```data_rankings[condition]```, I obtain a subset of the ```data_rankings``` DataFrame that satisfy the ```condition```, in this case, the codition is ```data_rankings['stars'] > 2``` which only selects rows in which the value for the column ```stars``` is greater than 2. This is known as filtering. The condition could be composed using ```&``` (AND) or ```|``` (OR), for example: ```(data_rankings['stars'] > 2) & (data_rankings['age'] > 40)```. In this case I ask starts to be greater than 2 and age to be greater than 40.

Let's get now the age of all the people that have 3 or more stars:

In [None]:
people = df['age'].to_numpy()
print(f'Age of people with stars > 2: {people}')

Note: I have used the ```to_numpy()``` function to get an ```ndarray```.

Another way to select rows satisfying a given condition is to use the ```query(condition)``` function:

In [None]:
people = data_rankings.query('stars > 2')['age'].to_numpy()
print(f'Age of people with stars > 2: {people}')

**<u>Grouping data</u>:**

What about if we wish to extract some information after grouping our data into categories? Let's go through two examples using the ```Walmart.csv``` file from [www.kaggle.com](https://www.kaggle.com/datasets/naveenkumar20bps1137/walmart-sales-analysis) which lists some of their received orders:

In [None]:
orders = pd.read_csv('Walmart.csv')
orders.head()


We can see therein some information, like the category assigned to each order as well how much it was spent on each order (```Sales```).

**i. Obtain the total of ```Sales``` for each category:**

For this, we need to use the ```groupby()``` function, then take the ```Sales``` column and sum all its values for each category, and finally sort them to show the highest values first. We can do all this with a single line, as shown below:

In [None]:
data = orders.groupby('Category')['Sales'].sum().to_frame().sort_values(by='Sales', ascending=False)
data.head()

**Note:**
- ```orders.groupby('Category')``` returns a groupby object, not a DataFrame.
- We get a Series after using ```sum()```, which is converted to a DataFrame with the ```to_frame()``` function.

It is also possible to group based on values from multiple columns. For example, let's now get the total sales for each category for every city:

In [None]:
data = orders.groupby(['Category', 'City'])['Sales'].sum().to_frame().sort_values(by='Sales', ascending=False)
data.head()

From this we know that the city with the best selling category is Los Angeles, where phones worth 29503.04 were sold.

**i. Obtain further sales statistics on every category:**

Let's now get the number of orders, the average sales and the total sales for every category. We will again use the ```groupby()``` function but this time together with the ```agg()``` function, to aggregate statistics on every category:

In [None]:
data = orders.groupby('Category').agg({'Sales': [np.size, np.mean, np.sum]}).sort_values(by=('Sales', 'sum'), ascending=False)
data.head()


**Note:**
- ```agg({'Sales': [np.size, np.mean, np.sum]})``` aggregates three operations (```np.size```, ```np.mean``` and ```np.sum```) over the ```Sales``` column.
- ```sort_values(by=('Sales', 'sum'), ascending=False)``` is used to order rows based on the total sales and showing first the highest values (i.e. descending order).

**Tip:** If you need to retrieve any of the aggregated columns, let's say for example the ```size``` column, then do the following: ```data['Sales']['size']```.

#### <div id="add">2.B.k Add new columns/rows</div>

**<u>Add a new column</u>:**

First, let's create a new DataFrame:

In [None]:
data_dict = {
    'A': [0, 1, 2, 3, 4],
    'B': [1, 2, 3, 4, 5],
    'C': [3, 1, 5, 6, 8],
    'D': [9, 7, 1, 3, 5],
}
df = pd.DataFrame(data_dict)
df.head()

Let's create now a new ```'D'``` column which would be the sum of all other columns. We will see two ways of doing this:

i. Sum of Series

In [None]:
df['E'] = df['A'] + df['B'] + df['C'] + df['D']
df.head()

ii. Using the ```apply()``` function:

In [None]:
def my_sum(values):
    return values.sum()

df = pd.DataFrame(data_dict)
df['E'] = df.apply(my_sum, axis=1)
df.head()

Note: we could also have used a lambda function with the ```apply()``` function:

In [None]:
df = pd.DataFrame(data_dict)
df['E'] = df.apply(lambda x:x.sum(), axis=1)
df.head()

Let's now see another example where we add a new column as a result of a condition applied to a given column. In this case, we add a new column ```E``` that is ```True``` if ```A``` is greater than 2 and ```False``` otherwise:

In [None]:
df = pd.DataFrame(data_dict)
df['E'] = df['A'] > 2
df.head()

Alternatively, we could have used the ```where()``` function from NumPy:

In [None]:
df = pd.DataFrame(data_dict)
df['E'] = np.where(df['A'] > 2, True, False)
df.head()

Let's see another example in which we add a new column to a DataFrame, but now using the ```insert()``` function, which is used to add a column to a given index in the column axis.

We will use again our ```data_rankings``` DataFrame:

In [None]:
data_rankings.head()

And we will group people into age groups. We will use the ```cut()``` function which can be used to segment data into bins (i.e. place data into discrete intervals, known as bins):

In [None]:
bins = [0, 15, 25, 35, 45, 65, 100]
data_rankings.insert(2, 'age-group', pd.cut(data_rankings['age'], bins))
data_rankings.head()

**<u>Add a new row</u>:**

Let's create again a new DataFrame:

In [None]:
df = pd.DataFrame(data_dict)
df.head()

Let's now add a new row using ```loc[]```, which for each column we will be the sum of all other rows:

In [None]:
df.loc[5] = df.apply(lambda x:x.sum(), axis=0)
df.head(6)

**Note:**
- I could have done the following as well: ```df.loc[5] = df.loc[0] + df.loc[1] + df.loc[2] + df.loc[3] + df.loc[4]```
- ```df.iloc[5]``` wouldn't work since there is no row at index 5, this can only be used to replace/update an existing row.

#### <div id="plotting">2.B.l Plotting</div>

In this section, we will create some graphical representations:

**<u>Vertical bar plot</u>:**

Let's recycle our ```data_rankings``` DataFrame and make a vertical bar plot using the number of counts for each age group:

In [None]:
data_rankings['age-group'].value_counts().plot.bar()

**Note:** ```data_rankings['age-group'].value_counts().plot(kind='bar')``` would also work.

**<u>Scatter plot</u>:**

Let's go back to our DataFrame using the ```cost_of_living.csv``` file. Let's say we wish to determine if there is any trend or correlation between ```purchasing_power_index``` and ```monthly_income```, in that case, we could want to look at a scatter plot like the following one:

In [None]:
df = pd.read_csv('cost_of_living.csv')
df.plot.scatter(x='monthly_income', y='purchasing_power_index')

**Note:** ```df.plot(x='monthly_income', y='purchasing_power_index', kind='scatter')``` would also make the trick.

**<u>Histogram</u>:**

Let's now make a histogram using the ```monthly_income``` column:

In [None]:

df['monthly_income'].plot(kind='hist')

For other plotting options, see [pandas.DataFrame.plot](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html).

#### <div id="other">2.B.m Other helpful attributes and functions</div>

**<u>```is_unique``` and ```set_index()```:</u>**

Sometimes, we expect a certain column to have unique values. We can confirm if that is correct in our DataFrame using the ```is_unique``` attribute.

Let's take a look at an example where a DataFrame has an ```'id'``` column with non-unique values and let's look at the result given by the ```is_unique``` attibute:

In [None]:
df = pd.DataFrame(np.array([[0], [1], [2], [2], [4], [5]]), columns=['id'])
df['id'].is_unique

As expected, ```is_unique``` returned ```False``` since ```2``` is repeated.

Let's now look at an example where all values are unique:

In [None]:
df = pd.DataFrame(np.array([[0], [1], [2], [3], [4], [5]]), columns=['id'])
df['id'].is_unique

Now as expected, ```is_unique``` returned ```True```. If we wish, since the ```'id'``` column has unique values, we could set it as index in our DataFrame with the ```set_index()``` function:

In [None]:
df.set_index('id', inplace=True)
df.head()

**<u>Pivotting your DataFrame</u>:**

Let's create a tiny DataFrame with prices from four different stores for two products (a TV and a smartphone):

In [None]:
products = pd.DataFrame({
    'product': ['TV', 'TV', 'TV', 'TV', 'Smartphone', 'Smartphone', 'Smartphone', 'Smartphone'],
    'store': ['Fnac', 'Fust', 'Interdiscount', 'Amazon', 'Fust', 'Fnac', 'Amazon', 'Interdiscount'],
    'price':[499.99, 498.05, 460.20, 451.90, 259.99, 260.05, 250.20, 258.39],
})
products.head(8)

Let's say we wish to more easily compare prices for the same products on different stores, if that's so, we could do something like the following:

In [None]:
pivot_products = products.pivot(index='product', columns='store')
print(pivot_products)

That's all for now. If you wish to learn about data visualization in Python, take a look at the next module / notebook (#5).