![Ironhack logo](https://i.imgur.com/1QgrNNw.png)

# Lab | Introduction to Pandas

## Introduction

In the Introduction to Pandas lesson, we learned about the two main data structures in Pandas (Series and DataFrames), how to work with them, how to obtain them from other data structures, and how to perform basic calculations with them.

The goal of this lab is to help you practice the concepts you learned in the lesson and provide you with some hands-on experience working with Pandas.

## Getting Started

Read the instructions for each cell and provide your answers. Make sure to test your answers in each cell and save. Jupyter Notebook should automatically save your work progress. But it's a good idea to periodically save your work manually just in case.

## Resources

- [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [Intro to Pandas Data Structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)
- [Descriptive Statistics for Pandas DataFrame](https://chrisalbon.com/python/data_wrangling/pandas_dataframe_descriptive_stats/)

# Introduction to Pandas Lab

Complete the following set of exercises to solidify your knowledge of Pandas fundamentals.

### 1. Import Numpy and Pandas and alias them to `np` and `pd` respectively.

In [1]:
import numpy as np
import pandas as pd

### 2. Create a Pandas Series containing the elements of the list below.

Expected output:

````python
            0     5.7
            1    75.2
            2    74.4
            3    84.0
            4    66.5
            5    66.3
            6    55.8
            7    75.7
            8    29.1
            9    43.7
            dtype: float64
    
````

In [2]:
lst = [5.7, 75.2, 74.4, 84.0, 66.5, 66.3, 55.8, 75.7, 29.1, 43.7]

In [3]:
# assign the new serie to a variable lst_s
lst_s = pd.Series(lst)

In [4]:
# print the values
print(lst_s)

0     5.7
1    75.2
2    74.4
3    84.0
4    66.5
5    66.3
6    55.8
7    75.7
8    29.1
9    43.7
dtype: float64


In [5]:
# checking the variable type
type(lst_s)

pandas.core.series.Series

### 3. Use indexing to return the third value in the Series above.

*Hint: Remember that indexing begins at 0.*

In [6]:
# the first index is 0, followed by 1, and then 2
lst_s[2]

74.4

### 4. Create a Pandas DataFrame from the list of lists below. Each sublist should be represented as a row.

Expected output:

|    |    0 |    1 |    2 |    3 |    4 |
|---:|-----:|-----:|-----:|-----:|-----:|
|  0 | 53.1 | 95   | 67.5 | 35   | 78.4 |
|  1 | 61.3 | 40.8 | 30.8 | 37.8 | 87.6 |
|  2 | 20.6 | 73.2 | 44.2 | 14.6 | 91.8 |
|  3 | 57.4 |  0.1 | 96.1 |  4.2 | 69.5 |
|  4 | 83.6 | 20.5 | 85.4 | 22.8 | 35.9 |
|  5 | 49   | 69   |  0.1 | 31.8 | 89.1 |
|  6 | 23.3 | 40.7 | 95   | 83.8 | 26.9 |
|  7 | 27.6 | 26.4 | 53.8 | 88.8 | 68.5 |
|  8 | 96.6 | 96.4 | 53.4 | 72.4 | 50.1 |
|  9 | 73.7 | 39   | 43.2 | 81.6 | 34.7 |

In [7]:
b = [[53.1, 95.0, 67.5, 35.0, 78.4],
     [61.3, 40.8, 30.8, 37.8, 87.6],
     [20.6, 73.2, 44.2, 14.6, 91.8],
     [57.4, 0.1, 96.1, 4.2, 69.5],
     [83.6, 20.5, 85.4, 22.8, 35.9],
     [49.0, 69.0, 0.1, 31.8, 89.1],
     [23.3, 40.7, 95.0, 83.8, 26.9],
     [27.6, 26.4, 53.8, 88.8, 68.5],
     [96.6, 96.4, 53.4, 72.4, 50.1],
     [73.7, 39.0, 43.2, 81.6, 34.7]]

In [8]:
# apply the pd.DataFrame() function to the list of lists 'b'
df = pd.DataFrame(b)

In [9]:
df

Unnamed: 0,0,1,2,3,4
0,53.1,95.0,67.5,35.0,78.4
1,61.3,40.8,30.8,37.8,87.6
2,20.6,73.2,44.2,14.6,91.8
3,57.4,0.1,96.1,4.2,69.5
4,83.6,20.5,85.4,22.8,35.9
5,49.0,69.0,0.1,31.8,89.1
6,23.3,40.7,95.0,83.8,26.9
7,27.6,26.4,53.8,88.8,68.5
8,96.6,96.4,53.4,72.4,50.1
9,73.7,39.0,43.2,81.6,34.7


In [10]:
# check the df type
type(df)

pandas.core.frame.DataFrame

In [11]:
df.iloc[3,0]

57.4

### 5. Rename the data frame columns based on the names in the list below.

Expected output:

|    |   Score_1 |   Score_2 |   Score_3 |   Score_4 |   Score_5 |
|---:|----------:|----------:|----------:|----------:|----------:|
|  0 |      53.1 |      95   |      67.5 |      35   |      78.4 |
|  1 |      61.3 |      40.8 |      30.8 |      37.8 |      87.6 |
|  2 |      20.6 |      73.2 |      44.2 |      14.6 |      91.8 |
|  3 |      57.4 |       0.1 |      96.1 |       4.2 |      69.5 |
|  4 |      83.6 |      20.5 |      85.4 |      22.8 |      35.9 |
|  5 |      49   |      69   |       0.1 |      31.8 |      89.1 |
|  6 |      23.3 |      40.7 |      95   |      83.8 |      26.9 |
|  7 |      27.6 |      26.4 |      53.8 |      88.8 |      68.5 |
|  8 |      96.6 |      96.4 |      53.4 |      72.4 |      50.1 |
|  9 |      73.7 |      39   |      43.2 |      81.6 |      34.7 |

In [12]:
colnames = ['Score_1', 'Score_2', 'Score_3', 'Score_4', 'Score_5']

In [13]:
# check the df's columns
df.columns

RangeIndex(start=0, stop=5, step=1)

In [14]:
# assing the list colnames to the df.columns
df.columns = colnames

In [15]:
# check the column's again
df.columns

Index(['Score_1', 'Score_2', 'Score_3', 'Score_4', 'Score_5'], dtype='object')

In [16]:
# check the df's head
df

Unnamed: 0,Score_1,Score_2,Score_3,Score_4,Score_5
0,53.1,95.0,67.5,35.0,78.4
1,61.3,40.8,30.8,37.8,87.6
2,20.6,73.2,44.2,14.6,91.8
3,57.4,0.1,96.1,4.2,69.5
4,83.6,20.5,85.4,22.8,35.9
5,49.0,69.0,0.1,31.8,89.1
6,23.3,40.7,95.0,83.8,26.9
7,27.6,26.4,53.8,88.8,68.5
8,96.6,96.4,53.4,72.4,50.1
9,73.7,39.0,43.2,81.6,34.7


### 6. Create a subset of this data frame that contains only the Score 1, 3, and 5 columns.

### subset the data based on column's names

Expected output:

|    |   Score_1 |   Score_3 |   Score_5 |
|---:|----------:|----------:|----------:|
|  0 |      53.1 |      67.5 |      78.4 |
|  1 |      61.3 |      30.8 |      87.6 |
|  2 |      20.6 |      44.2 |      91.8 |
|  3 |      57.4 |      96.1 |      69.5 |
|  4 |      83.6 |      85.4 |      35.9 |
|  5 |      49   |       0.1 |      89.1 |
|  6 |      23.3 |      95   |      26.9 |
|  7 |      27.6 |      53.8 |      68.5 |
|  8 |      96.6 |      53.4 |      50.1 |
|  9 |      73.7 |      43.2 |      34.7 |

In [17]:
# create a list of variables that we want to consider
subset = ['Score_1', 'Score_3', 'Score_5']

In [18]:
# slice the df with brackets
df_subset = df[subset]

In [19]:
df_subset.head()

Unnamed: 0,Score_1,Score_3,Score_5
0,53.1,67.5,78.4
1,61.3,30.8,87.6
2,20.6,44.2,91.8
3,57.4,96.1,69.5
4,83.6,85.4,35.9


In [20]:
# Other method

df_subset = df.iloc[:, [0,2,4]]
df_subset

Unnamed: 0,Score_1,Score_3,Score_5
0,53.1,67.5,78.4
1,61.3,30.8,87.6
2,20.6,44.2,91.8
3,57.4,96.1,69.5
4,83.6,85.4,35.9
5,49.0,0.1,89.1
6,23.3,95.0,26.9
7,27.6,53.8,68.5
8,96.6,53.4,50.1
9,73.7,43.2,34.7


In [21]:
# Other method

df_subset = df.loc[ : , ['Score_1', 'Score_3', 'Score_5'] ]
df_subset

Unnamed: 0,Score_1,Score_3,Score_5
0,53.1,67.5,78.4
1,61.3,30.8,87.6
2,20.6,44.2,91.8
3,57.4,96.1,69.5
4,83.6,85.4,35.9
5,49.0,0.1,89.1
6,23.3,95.0,26.9
7,27.6,53.8,68.5
8,96.6,53.4,50.1
9,73.7,43.2,34.7


### 7. From the original data frame, calculate the average Score_3 value.


Expected output:

````python
            56.95
````

In [22]:
# select the 'Score_3' column and then call the function mean
df.Score_3.mean()

56.95000000000001

In [23]:
# we can also select the 'Score_3' column using brackets
# brackets is mostly used as we can slice colunns with space in its name
df['Score_3'].mean()

56.95000000000001

### 8. From the original data frame, calculate the maximum Score_4 value.

Expected output:

````python
            88.8
````

In [24]:
# use the same method describe at exercise 7
df['Score_4'].max()

88.8

In [25]:
# Or

df.Score_4.max()

88.8

### 9. From the original data frame, calculate the median Score 2 value.


Expected output:

````python
            40.75
````

In [26]:
# use the same method describe at exercise 8
df['Score_2'].median()

40.75

In [27]:
# Or

df.Score_2.median()

40.75

### 10. Create a Pandas DataFrame from the dictionary of product orders below.

Expected output:

|    | Description                       |   Quantity |   UnitPrice |   Revenue |
|---:|:----------------------------------|-----------:|------------:|----------:|
|  0 | LUNCH BAG APPLE DESIGN            |          1 |        1.65 |      1.65 |
|  1 | SET OF 60 VINTAGE LEAF CAKE CASES |         24 |        0.55 |     13.2  |
|  2 | RIBBON REEL STRIPES DESIGN        |          1 |        1.65 |      1.65 |
|  3 | WORLD WAR 2 GLIDERS ASSTD DESIGNS |       2880 |        0.18 |    518.4  |
|  4 | PLAYING CARDS JUBILEE UNION JACK  |          2 |        1.25 |      2.5  |
|  5 | POPCORN HOLDER                    |          7 |        0.85 |      5.95 |
|  6 | BOX OF VINTAGE ALPHABET BLOCKS    |          1 |       11.95 |     11.95 |
|  7 | PARTY BUNTING                     |          4 |        4.95 |     19.8  |
|  8 | JAZZ HEARTS ADDRESS BOOK          |         10 |        0.19 |      1.9  |
|  9 | SET OF 4 SANTA PLACE SETTINGS     |         48 |        1.25 |     60    |

In [28]:
orders = {'Description': ['LUNCH BAG APPLE DESIGN',
  'SET OF 60 VINTAGE LEAF CAKE CASES ',
  'RIBBON REEL STRIPES DESIGN ',
  'WORLD WAR 2 GLIDERS ASSTD DESIGNS',
  'PLAYING CARDS JUBILEE UNION JACK',
  'POPCORN HOLDER',
  'BOX OF VINTAGE ALPHABET BLOCKS',
  'PARTY BUNTING',
  'JAZZ HEARTS ADDRESS BOOK',
  'SET OF 4 SANTA PLACE SETTINGS'],
 'Quantity': [1, 24, 1, 2880, 2, 7, 1, 4, 10, 48],
 'UnitPrice': [1.65, 0.55, 1.65, 0.18, 1.25, 0.85, 11.95, 4.95, 0.19, 1.25],
 'Revenue': [1.65, 13.2, 1.65, 518.4, 2.5, 5.95, 11.95, 19.8, 1.9, 60.0]}

In [29]:
# we can create dictionaries by applying the function DataFrame() into the dictionary
df2 = pd.DataFrame(orders)

In [30]:
df2.head()

Unnamed: 0,Description,Quantity,UnitPrice,Revenue
0,LUNCH BAG APPLE DESIGN,1,1.65,1.65
1,SET OF 60 VINTAGE LEAF CAKE CASES,24,0.55,13.2
2,RIBBON REEL STRIPES DESIGN,1,1.65,1.65
3,WORLD WAR 2 GLIDERS ASSTD DESIGNS,2880,0.18,518.4
4,PLAYING CARDS JUBILEE UNION JACK,2,1.25,2.5


### 11. Calculate the total quantity ordered and revenue generated from these orders.

Expected output:

````python
            Quantity    2978.0
            Revenue      637.0
````

In [31]:
# select the variable 'Quantity' from df2 and call sum() function
df2.Quantity.sum()

2978

In [32]:
# select the variable 'Revenue' from df2 and call max() function
df2.Revenue.sum()

637.0

In [33]:
# we can also subset the dataframe selecting the variabels 'Quantity' and 'Revenue' at the same time
df2[['Quantity', 'Revenue']].sum()

Quantity    2978.0
Revenue      637.0
dtype: float64

### 12. Obtain the prices of the most expensive and least expensive items ordered and print the difference.

Expected output:

````python
            Most expensive:  11.95
            Least expensive:  0.18
            Difference:  11.77
````

In [34]:
# select the variable 'UnitPrice' from df2 and call max() function
expensive = df2.UnitPrice.max()

In [35]:
# select the variable 'UnitPrice' from df2 and call min() function
cheap = df2.UnitPrice.min()

In [36]:
print(expensive - cheap)

11.77


In [37]:
print('Most expensive: ', df2.UnitPrice.max())
print('Least expensive: ', df2.UnitPrice.min())
print('Difference: ', expensive - cheap)

Most expensive:  11.95
Least expensive:  0.18
Difference:  11.77


# BONUS

### 1 Create a random dataframe containing 3 columns and 10 rows.
Hint: You'll have to use a numpy.random method to create random observations

**Example of output:**

|    |         0 |        1 |        2 |
|---:|----------:|---------:|---------:|
|  0 | 0.0883504 | 0.867054 | 0.977949 |
|  1 | 0.819555  | 0.643927 | 0.122975 |
|  2 | 0.484618  | 0.515125 | 0.807791 |
|  3 | 0.539583  | 0.620799 | 0.926963 |
|  4 | 0.0479896 | 0.772282 | 0.621921 |
|  5 | 0.1398    | 0.767678 | 0.678816 |
|  6 | 0.0895157 | 0.242867 | 0.887941 |
|  7 | 0.160186  | 0.750576 | 0.439416 |
|  8 | 0.982747  | 0.652903 | 0.575518 |
|  9 | 0.646373  | 0.523237 | 0.700222 |

In [38]:
#the np.random module has several functions to return random arrays explore a little the documentation, the one we are using here np.random.rand
#needs the shape of the array to be returned, it returns np.float values.
df = pd.DataFrame(np.random.rand(10,3))

In [39]:
df

Unnamed: 0,0,1,2
0,0.83402,0.154581,0.402568
1,0.092287,0.960247,0.080577
2,0.844367,0.818094,0.115054
3,0.027981,0.516355,0.435364
4,0.014334,0.881706,0.783343
5,0.007977,0.273311,0.632282
6,0.273395,0.513848,0.453065
7,0.428426,0.734763,0.250367
8,0.227289,0.231936,0.589023
9,0.808542,0.153248,0.600863


#### 2 Create a random dataframe containing 3 columns and 10 rows with specific column names. 

Use `['column1', 'column2', 'column3']` as the column names.

**Example of output:**

|    |   column1 |   column2 |   column3 |
|---:|----------:|----------:|----------:|
|  0 |        10 |        21 |        45 |
|  1 |        32 |        87 |        39 |
|  2 |        93 |        53 |         8 |
|  3 |        52 |        77 |        41 |
|  4 |        72 |        50 |        31 |
|  5 |        16 |         6 |        41 |
|  6 |        66 |        33 |        62 |
|  7 |        97 |        50 |        92 |
|  8 |        21 |        71 |        42 |
|  9 |        91 |        97 |        32 |

In [40]:
#the function np.random.randint takes the shape of the array as well as the range of int the random numbers can be selected from
array_size=(10,3)
df = pd.DataFrame(np.random.randint(0, 100, size=array_size),columns=['column1','column2', 'column3'])

In [41]:
df

Unnamed: 0,column1,column2,column3
0,84,22,97
1,50,18,73
2,1,5,39
3,96,15,49
4,80,81,26
5,60,88,60
6,33,38,22
7,85,21,84
8,69,64,76
9,8,69,13


#### 3 Create a random dataframe containing N columns and M rows with specific column names.

For this task you'll have to name the columns as `['column1', 'column2', 'column3', ...,'columnN']`. You can use a list comprehension to do that.

**Example of output:**

|    |   column1 |   column2 | ... |   column9 |   column10 |
|---:|----------:|----------:|--------:|----------:|-----------:|
|  0 |        26 |        37 |      ... |        15 |         40 |
|  1 |         8 |        75 |      ... |        86 |         76 |
|  2 |        60 |        24 |      ... |        60 |         58 |
|  3 |        35 |        38 |      ... |        65 |         30 |
|  ... |        ... |        ... |      ... |        ... |         ... |
| 26 |        87 |        26 |      ... |        77 |         98 |
| 27 |        97 |        94 |      ... |        37 |         36 |
| 28 |        97 |        74 |      ... |         8 |         71 |
| 29 |        63 |        17 |      ... |        89 |         69 |

In [42]:
N = 10
M = 30
df = pd.DataFrame(np.random.randint(0, 100, size=(M,N)),columns=["column"+str(i+1) for i in range(N)])
df

Unnamed: 0,column1,column2,column3,column4,column5,column6,column7,column8,column9,column10
0,72,32,94,40,57,68,39,13,11,14
1,41,61,21,76,55,24,70,23,33,38
2,31,56,18,59,58,44,0,25,11,54
3,29,38,9,71,34,88,27,34,24,86
4,6,39,77,59,14,27,62,43,60,93
5,46,92,87,90,74,17,23,87,86,92
6,58,2,8,43,59,75,25,65,41,32
7,7,29,77,51,74,87,3,29,16,87
8,24,93,37,37,94,29,43,45,93,15
9,49,62,97,27,18,42,79,7,10,16
