# Introduction to Pandas

This jupyter notebook is optional for those who feel proficient on the use of pandas library

**pandas** arises from the need to have a specific library to analyze data that provides, in the simplest possible way, all the instruments for data processing, data extraction, and data manipulation.

Pandas design is based on numpy


Pandas documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/io.html



If you do not have pandas installed, you will need to install it. Enter the following
command:

pip install pandas 

In this notebook we  will cover:

1. Series 
2. DataFrames
4. GroupBy
5. Merging, Joining and Concatenating
6. Operations
7. Data Input and Output

## Let's get started

In [None]:
import pandas as pd
import numpy as np 

"""
Every time you see pd and np, 
you’ll make reference to an object or method referring to these two libraries
"""

In [None]:
"""
you can check the version using __version__
"""
pd.__version__

## 1. Series


The series is the object of the pandas library designed to represent one-dimensional data structures, similar to an array but with some additional features. Its internal structure is simple  and is composed of two arrays associated with each other. The main array holds the data (data of any NumPy type) to which each element is associated with a label, contained within the other array, called the index.

### Defining Series

In [None]:
"""
To declare a series  call the Series() constructor and pass as an argument an array containing the values to be included in it.
"""
series1 = pd.Series([1,2,3])

In [None]:
series1

In [None]:
"""
you can name the series and give also labels to the index!
it's like a numpy array with names! 

1. create a list with the index labels
2. a list with the values
3. and pass them as parameters to the Series constructor!
"""
labels = ["a","b","c"] 
mydata = [1,2,3]
series2 = pd.Series(data=mydata, index=labels, name="my_series") 

In [None]:
series2

In [None]:
"""
you can define a new series by a dictinary too
Note: A dictionary is a collection which is unordered, changeable and indexed. You can think of it as a map, mapping values to keys
"""
dict_ = {"a": 10, "b":20, "c":30} #"a" these are keys and the numbers are values.



In [None]:
dict_

In [None]:
"""
create the series in the same way, this time pass as a parameter the dictionary that we created
"""
series3 = pd.Series(dict_)

In [None]:
"""
As you can see from this example, the array of the index is filled with the keys while
the data are filled with the corresponding values
"""
series3

In [None]:
"""
If you want to individually see the two arrays that make up this data structure, you
can call the two attributes of the series as follows: 

index and values
"""

print("series values: ", series3.values)
print("series index: ", series3.index)

### Selecting elements from series

In [None]:
series2

In [None]:
"""
Using the key:

You can select individual elements as ordinary numpy arrays, specifying the key. 
Start counting from 0, for the 1st element
"""
series2[1]

In [None]:
"""
Or label:

or by specifing the label corresponding to the position of the index.
"""

series2["b"]

In [None]:
"""
For  mupltiple elements:
In the same way you select multiple items in a numpy array, you can specify the following 
"""
series2[0:2]

In [None]:
#or:
series2[["a", "b"]]

## Dataframes 

Unlike series, which have an index array containing labels associated with each
element, the dataframe has two index arrays. The first index array, associated with
the lines, has very similar functions to the index array in series. In fact, each label is
associated with all the values in the row. The second array contains a series of labels,
each associated with a particular column.
A dataframe may also be understood as a dict of series, where the keys are the
column names and the values are the series that will form the columns of the dataframe. All elements in each series are mapped according to an array of labels, called the index. 

### Defining dataframes

In [None]:
"""
We will create a dictionary with many keys, each key will have a list of elements

Important note: A dictionary has unique keys. You can't have keys with the same value
"""

data = {'color' : ['blue','green','yellow','red','white'],
 'object' : ['ball','pen','pencil','paper','mug'],
 'price' : [1.2,1.0,0.6,0.9,1.7]}

In [None]:
data

In [None]:
"""
Convert it into a dataframe!
"""
df = pd.DataFrame(data)

In [None]:
"""
Notice that by default the keys of the dict become the DataFrame columns
"""
df

In [None]:
#you can make a selection of columns
frame2 = pd.DataFrame(data, columns=['object','price'])

In [None]:
frame2

In [None]:
"""
You can also assign labels to the index! this you can do by specifying the index parameter of the DataFrame constructor
"""
df = pd.DataFrame(data, index = ["one", "two", "three", "four", "five"])

In [None]:
df

### Slicing and locating

In [None]:
"""
Pass the name of the column, and you'll get the column you want
"""
df["color"]

In [None]:
"""
The result is a series object

Check the type by the type()
"""

type(df["color"])

In [None]:
"""
Another way to grab a column is by specifying the name of the column.

The same as when using the bracket notation. Sometimes python might confuse it with a method and thus cause an error. 
So use the bracket notation to be on the safe side.  

Note: if you type df. and tab you'll see all the different methods you can call for the specific object
"""

df.color

In [None]:
"""
Selection of multiple columns
"""

df[["object", "price"]]

In [None]:
"""
.loc

location based index
"""
df.loc["one"]

In [None]:
"""
.iloc

index based location
"""
df.iloc[0]

In [None]:
""" 
selecting subsets of rows and columns , using loc 
"""
df.loc["one","price"]

In [None]:
"""
The same but using iloc

In this case you need to specify the index
"""
df.iloc[0,2]

In [None]:
"""
Selecting a subset of the dataframe 

format df.loc[[rows], [columns]]
"""
df.loc[["one","two"], ["color", "price"]]

In [None]:
"""
The same result using iloc
"""
df.iloc[[0,1], [0,2]]

In [None]:
df

In [None]:
"""
Similarly you can define a range and get the subset of a dataframe
"""
df.iloc[1:3,1:3]

### Assigning values

In [None]:
"""
Creating a new column in the dataframe called new. 

Just assign a new column with the desired name and give it the values that you want. 

Here is an example where the new column is the combination of color and object columns
"""
df["new"] = df["color"] + df["object"]

In [None]:
df

In [None]:
"""
Here the new column which is called new2 is the price column in the power of 2. This operation happens row wise
"""

df["new2"] = df["price"] **2

In [None]:
df

### Drop 

In [None]:
"""

Let's now remove the columns that we created above. 


You can use .drop() to do that. As parameters pass the name or list of names that you want to delete and specify the axis. 

Note1: 
If axis=0, it will delete the rows with the given nane 
if axis=1, it will delete the columns with the given 'name'

"""

df.drop(["new", "new2"], axis=1)
print(df)


In [None]:
"""
Note2:

pandas requires you to type inplace=True if you want all these to stay in place. otherwise it's just going to display but it's not going to be saved

"""
df.drop(["new", "new2"], axis=1, inplace=True)

In [None]:
"""

Let's drop now the row with index label equal to three and just display the result


"""
df.drop("three", axis=0)

### Assign a series to a dataframe

In [None]:
ser = pd.Series(np.arange(5), index=["one", "two", "three", "four", "five"]) #Return evenly spaced values within a given interval.


#Note that the index of the series and the index of the dataframe should much

#assign it to the dataframe
df['new'] = ser

In [None]:
ser

In [None]:
df

In [None]:
### Membership Values


"""

You get a dataframe containing Boolean values, where True indicates values that
meet the membership.

"""


df.isin([1.0,'pen'])




In [None]:
"""


If you pass the value returned as a condition, then you’ll get a
new dataframe containing only the values that satisfy the condition.

"""


df[df.isin([1.0,'pen'])]

In [None]:
#another way to delete a column

del df["new"]

In [None]:
df

### Conditional selection

In [None]:

"""
Conditional selection using brackets notations:
You can apply the filtering through the application of certain conditions.


Let's check, where is the dataframe less that 1.2?
Direct masking operations are interpreted row-wise and not column-wise

"""

df[df["price"]<1.2]


In [None]:
#multiple conditions, use parentheses!

df[(df["price"]<1.2) & (df["object"]=="pencil")]

### Transposition

In [None]:
"""
An operation that you might need when you’re dealing with tabular data structures is
transposition (that is, columns become rows and rows become columns). pandas allows
you to do this in a very simple way. You can get the transposition of the dataframe by
adding the T attribute to its application.

"""
df.T

### Sample

In [None]:
"""
Return a random sample of items from an axis object

"""


df.sample(n=1)

In [None]:
df.sample(frac=0.5)

### Create intervals

In [None]:
df

In [None]:
"""
Bin values into discrete intervals

"""



df["price_bins"] = pd.cut(x = df["price"], bins=[0, 0.6, 1.0, 1.7], labels=["interval1", "interval2", "interval3"])



In [None]:
df

### Set and reset index

In [None]:
"""
The index will become the first column of the dataframe
"""
df.reset_index(inplace=True)



In [None]:
df

In [None]:
"""
You can now rename the new index that you created.
Assign a new column with the desired index names
set the column as the new index using .set_index()

"""


index = ["a","b","c","d","e"]
df["new_index"] = index
df.set_index("new_index", inplace=True)

In [None]:
df

### Multi-index


In [None]:
#Multi-index
outside = ["G1", "G1", "G1", "G2", "G2", "G2"]
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside)) #list of tuple pairs
hier_index = pd.MultiIndex.from_tuples(hier_index) #a customization of making a df, takes a list creates a multi-index

In [None]:
from numpy.random import randn
df = pd.DataFrame(randn(6,2), hier_index, ["A", "B"])
#constructing a multi-level index

In [None]:
df

In [None]:
#call data
df.loc["G1"]
df.loc["G1"].loc[1]

In [None]:
df.index.names
#pandas the second indeses don't have names
#so you can d: 
df.index.names = ["Groups", "Num"]

In [None]:
df.loc["G2"].loc[2]["B"]

In [None]:
#cross-section fo multi-level index
# it has the ability to go inside a muli-level index
df.xs("G1")
#you want all the values were inner index is 1
df.xs(1,level="Num") #it would be more complicated with loc

### GroupBy


In [None]:
"""

Groupby allows you to group together rows based off if a column and perform an aggregate function on them


"""

# Create a new dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}


In [None]:
data = pd.DataFrame(data)

In [None]:
data

In [None]:
"""

#Use groupby

Using only groupby it will return a groupby object that points out where it's stored in memory.



"""
data.groupby('Company')

In [None]:
"""
An essential piece of analysis of large data is efficient summarization: computing
aggregations like sum(), mean(), median(), min(), and max(), in which a single number
gives insight into the nature of a potentially large datase


So after groupby, specify a summarization function!

"""
byComp = data.groupby('Company').mean() 

In [None]:
byComp
#returns only for sales because person is strings

In [None]:
byComp = data.groupby('Company').sum().loc["FB"]

In [None]:
byComp

In [None]:
byComp = data.groupby('Company').describe()

In [None]:
byComp

### Merging, Joining, Concatenating

All of the preceding routines worked on single arrays. It’s also possible to combine
multiple arrays into one, and to conversely split a single array into multiple arrays.
We’ll take a look at those operations here.

In [None]:
#Let's first create three dataframes



df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])


In [None]:
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

In [None]:
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

In [None]:
df1

### Concatenation

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use **pd.concat** and pass in a list of DataFrames to concatenate together:

In [None]:
pd.concat([df1,df2,df3])

In [None]:
"""
#concat uses by default axis=0 and concats along the columns. 


Let's now try axis=1. 
You can see missing values becauses the rows of the three dataframes do not match!



Important!
Make sure that you have info that lines up correctly when u join the axis

"""

pd.concat([df1,df2,df3],axis=1)

### Merge

In [None]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})    



The **merge** function allows you to merge DataFrames together using a similar logic as merging SQL Tables together. For example:

In [None]:
pd.merge(left,right,how='inner',on='key')


### Joining


In [None]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

In [None]:
#like merge, but keys are on index instead of a column
left.join(right)

### Functions by Row and Column

You can also use your own definition of functions to make changes in a dataframe.
The important point is that they operate on a one-dimensional
array, giving a single number as a result. For example, you can define a lambda function
that calculates the range covered by the elements in an array


In [None]:
#lambda is an anonymous function
f = lambda x: x.max() + x.min()

In [None]:
 # The above lambda function is equivalent to this function
 def f(x):
    return x.max() - x.min()

In [None]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
    index=['red','blue','yellow','white'],
    columns=['ball','pen','pencil','paper'])

In [None]:
frame

In [None]:
#Using the apply() function, you can apply the f function just defined on the dataframe column wise
frame.apply(f)

In [None]:
 #and of course row wise specifying axis=1
 frame.apply(f, axis=1)

### Input and Output

In [None]:
### Let's now store this dataframe into a csv file
frame

In [None]:
"""
you can do that by calling the to_csv function and pass as a parameter the path where you want the new file to be stored at. 
If you don't specidy the path and just type the name of the file then it will be stored under your current working directory
"""

frame.to_csv("frame.csv")

In [None]:
"""
Now, if you want to read a csv file you can use the read_csv

"""

data = pd.read_csv("frame.csv", index_col=0 )

In [None]:
data


# End of notebook for pandas
