This notebook is about introducing the Pandas library and practicing on it. 

# Performance:

## Speed of Calculation: 

***
***


Why do we use Numpy in Data Science?  
Because it is fast. How fast?  

Let's find out. I am going to test the speed of a sum operation over 1 Million entries by using the generic Python `sum()` method and the Numpy `np.sum()` method. Then the `%timeit` method will return data about how fast each calculation was. 

In [2]:
import numpy as np

array_one_million = np.random.rand(1000000)  

%timeit sum(array_one_million)  # using the default Python sum() method
%timeit np.sum(array_one_million)  # using the Numpy sum method

93.7 ms ± 371 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
975 µs ± 180 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Pandas  

***


Pandas is actually built on Numpy, so the performance of Numpy is brought to Pandas.  
Pandas offers many more tools to work on data. We'll see many of them. 

In [3]:
import pandas as pd  
print(pd.__version__)

2.1.4


I just imported the module of Pandas and I checked what version is available here for us.  

Now let's use some of its basic methods: 

In [9]:
panda_array = pd.Series([2, 3, 4, 5], index=['a', 'b', 'c', 'd'])
print(panda_array.values)
print(panda_array.index)

[2 3 4 5]
Index(['a', 'b', 'c', 'd'], dtype='object')


We just created a simple 1 dimension array by using the method `pd.Series` and declaring an index list for it.  

>NOTE:  
The syntax is as usual very critical, however **pay attention to the "S" of `.Series`**. I made the mistake of writting it with a small one "s".  

The `.values` method returns what it says, the values of the array. And the `.index` method returns a list of the indexes we declared. We call them **"Explicit Indexes"** as opposed to the default implicit ones (`[0, 1, 2, 3, ...]`).  
The created Series here shows a structure that is similar to the pairing that dictionary are based upon: key and values. 

With a simple example like this one we can already check how Pandas work. Let's see the type of the array returned by `.values` and the type of the variable holding that array: `panda.array`:

In [7]:
print(type(panda_array.values))  
print(type(panda_array))

<class 'numpy.ndarray'>
<class 'pandas.core.series.Series'>


We have an **ndarray**, straight from the numpy logic, and that is held in a **Pandas Series**. 

#### Dictionary first, then Series:

We do not have to declare a Series like we did above here.  
It is also possible to create a dictionary first and then pass it in the Series methodm just like so:

In [12]:
dict_example = {'A':4, 
               'A-':3.5,
               'B':3,
               'B-':2.5,
               'B':2}
panda_series_1 = pd.Series(dict_example)
panda_series_1

A     4.0
A-    3.5
B     2.0
B-    2.5
dtype: float64

#### Slight Slice Difference:

So we have lists that make a Series that has a similar structure to that of a dictionary. That means we can slice the indices.  
Pandas is a little better here because the slicing does not exclude the upper limit that is declared. Here's an example:  

In [11]:
panda_array['a':'c']

a    2
b    3
c    4
dtype: int64

In numpy and python, slicing using `['a':'c']` would have excluded the 'c' index's value, returning only values of 'a' and 'b'.  
Not in Pandas, the value of 'c' is also returned. 

> NOTE:  
But this works ONLY with the declared index list. Pandas still creates a default index for each declared value: `[0, 1, 2, 3, ...]`. Using that default implicit index list for slicing calls the default behavior (and so excludes the last index value). 

## .DataFrame():  

***
***


### Quantitative Visualization:

### For multiple dimensions arrays:  

***


The method `pd.DataFrame()` introduces data visualization as a tool in Panda.  
The returned output is a easier to look at visual table:

In [13]:
dataframe_table = pd.DataFrame({'Column 1': panda_array, 'Column 2': panda_series_1})
dataframe_table

Unnamed: 0,Column 1,Column 2
A,,4.0
A-,,3.5
B,,2.0
B-,,2.5
a,2.0,
b,3.0,
c,4.0,
d,5.0,


#### Accessing the Data:

In [15]:
dataframe_table.values[0, 1]  # Accessing the 1st row [0] and second column [1]

4.0

#### Adding a Column:

In [18]:
dataframe_table['Column 3'] = np.random.rand(8)  # same syntax of adding an entry pair ['key', 'value'] to a dictionary.
dataframe_table

Unnamed: 0,Column 1,Column 2,Column 3
A,,4.0,0.460539
A-,,3.5,0.385398
B,,2.0,0.759431
B-,,2.5,0.435794
a,2.0,,0.217021
b,3.0,,0.984819
c,4.0,,0.310816
d,5.0,,0.625744


### NaN or Not A Number:  

***


This `NaN` appears instead of a missing value. It could have been because there were no declared values for these or thatm like in my exemple, the joining of both arrays into a DataFrame created entries of rows/columns that did not exist before. 

#### Replacing NaN:

There are 2 ways to deal with NaN (there are more but we'll see 2 simple ones).  

To replace them with a value with the `.fillna()` method.  
OR  
To drop them with the every unexpectedly named method `.dropna()`

In [32]:
dataframe_table.fillna(0, inplace=True)  # replacing all the NaN with 0
dataframe_table

Unnamed: 0,Column 1,Column 2,Column 3
A,0.0,4.0,0.460539
A-,0.0,3.5,0.385398
B,0.0,2.0,0.759431
B-,0.0,2.5,0.435794
a,2.0,0.0,0.217021
b,3.0,0.0,0.984819
c,4.0,0.0,0.310816
d,5.0,0.0,0.625744


The `inplace` argument is set to True so that the original DataFrame is affected by the method.  
So no new DataFrame is created. 

In [36]:
dropped_nan = dataframe_table.dropna  # dropping the NaN values
dropped_nan


<bound method DataFrame.dropna of     Column 1  Column 2  Column 3
A        0.0       4.0  0.460539
A-       0.0       3.5  0.385398
B        0.0       2.0  0.759431
B-       0.0       2.5  0.435794
a        2.0       0.0  0.217021
b        3.0       0.0  0.984819
c        4.0       0.0  0.310816
d        5.0       0.0  0.625744>


## Implicit Vs Explicit Indexes List:  

***
***


### .iloc[] and .loc[]:  

***


We saw earlier that when we declare a Series in Pandas, one way is to declare both the keys and the values. Here's what we wrote:  
```
panda_array = pd.Series([2, 3, 4, 5], index=['a', 'b', 'c', 'd'])
print(panda_array.values)
print(panda_array.index)
```
The first declared list is our keys list `[2, 3, 4, 5]` and the 2nd one is our values list OR rather our EXPLICIT index list `['a', 'b', 'c', 'd']`. 

In contrast, the **IMPLICIT** list of indexes is created by default by Pandas `[0, 1, 2, 3,]`. 

When using SLICING methods (and ONLY when using them) the default way that is used is the IMPLICIT one. However the EXPLICIT way is less confusing because it returns the whole declared range where the IMPLICIT list skips the last index's value of that same range.  
This could lead to a lot of confusion and mistakes, especially when both are a numerical list. 

The `iloc[]` method refers to the implicit list.  
The `.loc[]` method refers to the explicit list. 

Let's see an example:

In [39]:
explicit_array = pd.Series([1, 2, 3, 4], index=[1, 2, 3, 4])
print(explicit_array)

print(f"\nThis is a slicing refering to an IMPLICIT index list:\n{explicit_array[1:3]}")
print(f"\nThis is a slicing refering to an EXPLICIT index list:\n{explicit_array.loc[1:3]}")

1    1
2    2
3    3
4    4
dtype: int64

This is a slicing refering to an IMPLICIT index list:
2    2
3    3
dtype: int64

This is a slicing refering to an EXPLICIT index list:
1    1
2    2
3    3
dtype: int64


>You notice that I did not use the `iloc[]` method to showcase the implicit referencing, this is because it is default behavior.  
It is clear that the explicit `loc[]` method is more interesting and less ambiguous to use. 

>Note that `.reshape()` HAS TO follow an `np.arange()` like it is in the example above.  
Otherwise it would not work.

The syntax for that is:  

`array[::x]`  
where x is the jump value. So if x = 2, it will go from 0 to the last value and jumps 2 numbers. 

In [25]:
array_tobe_sliced[::3]  # slicing from Start to End with a jump of 3

array([   0, -999,    6,    9,   12,   15,   18,   21,   24,   27,   30,
         33,   36,   39,   42,   45,   48,   51,   54,   57,   60,   63,
         66,   69,   72,   75,   78,   81,   84,   87,   90,   93,   96,
         99])

So we sliced from the start (0) to the end (99) and jumped 3 indexes every time.

#### from End to start with jumps:

This is almost the same syntax as the previous one:  

`array[::-x]`  
The -x indicates to numpy that it should start at the end and go backward down to the beginning. 

In [26]:
array_tobe_sliced[::-3]  # slicing from the end (99) down to 0 with jumps of 3

array([  99,   96,   93,   90,   87,   84,   81,   78,   75,   72,   69,
         66,   63,   60,   57,   54,   51,   48,   45,   42,   39,   36,
         33,   30,   27,   24,   21,   18,   15,   12,    9,    6, -999,
          0])

So if instead of -3 we use -1 we will get every entry in a reverse order.  
This is a way to reverse an array.

In [27]:
val_index_tofind = -999

for each_entry in array_tobe_sliced:
    is_it_1 = int(val_index_tofind) / array_tobe_sliced[each_entry]
    if is_it_1 == 1:
        print(f"the index of the value -999 is {array_tobe_sliced[each_entry]}")
        break

  is_it_1 = int(val_index_tofind) / array_tobe_sliced[each_entry]


IndexError: index -999 is out of bounds for axis 0 with size 100

## Extracting information from Matrice:

***
***


### Accessing rows:  

***


To return a whole row from a matrice the syntax is:  

`array[row_index,:]`
Where `row_index` is the index of the row one wants to get returned. 

>REMEMBER that the **first index is always 0**, so the first row is indexed at value 0.  

This will give this example:

In [28]:
array_a = np.round(10*np.random.rand(5, 4))  # the .round is rounding the values to an integer value (no float), and we get a 5 rows 4 col matric
array_a

array([[ 4.,  4.,  1.,  3.],
       [ 3.,  0.,  7., 10.],
       [ 4.,  0.,  4.,  8.],
       [ 5.,  1.,  4.,  7.],
       [ 3.,  9.,  5.,  7.]])

In [29]:
array_a[0,:]

array([4., 4., 1., 3.])

So the 0 in `[0,:]` references the 1st row and the ":" tells numpy to return the whole row. 

### Accessing Columns:  

***


Now to access columns one just needs to invert the values. So instead of:  
`array_a[0,:]` for the whole 1st row  

we write:  
`array_a[:,0]` to get the whole 1st column

In [30]:
array_a[:,0]

array([4., 3., 4., 5., 3.])

#### Accessing both:

So now that we know that the syntax structure reference the row for the first argument value and the column for the 2nd one,  
we can follow that structure to access any slice of any of row or column. 

In [31]:
array_a[1, 2:4]  # returning from the 2nd row, columns 3 & 4

array([ 7., 10.])

In [32]:
array_a[0:2, 1:4]  # returning from rows 1 to 2 the columns 2 to 4

array([[ 4.,  1.,  3.],
       [ 0.,  7., 10.]])

### .argwhere() to find the index of a value:  

***


When we want to find the index of a value sometimes looking at the matrice itself is tedious, especially if it is a big matrice with tens of rows and col.  

In this case the `np.argwhere()` method will return the index of any given entry's value:

Let's try it with the value -999 in the matrice array_tobe_sliced, which I will rename to save some time:

In [33]:
atbs = array_tobe_sliced
atbs

array([   0,    1,    2, -999,    4,    5,    6,    7,    8,    9,   10,
         11,   12,   13,   14,   15,   16,   17,   18,   19,   20,   21,
         22,   23,   24,   25,   26,   27,   28,   29,   30,   31,   32,
         33,   34,   35,   36,   37,   38,   39,   40,   41,   42,   43,
         44,   45,   46,   47,   48,   49,   50,   51,   52,   53,   54,
         55,   56,   57,   58,   59,   60,   61,   62,   63,   64,   65,
         66,   67,   68,   69,   70,   71,   72,   73,   74,   75,   76,
         77,   78,   79,   80,   81,   82,   83,   84,   85,   86,   87,
         88,   89,   90,   91,   92,   93,   94,   95,   96,   97,   98,
         99])

Now we can use the method:

In [35]:
index_of_minus999 = np.argwhere(atbs==-999)[0][0]
index_of_minus999

3

The variable now holds the index value of -999, which is 3. 