# Working with a `pandas.Dataframe`

These exercises and drills aim to build your experience of using an `pandas.DataFrame`.    Try to work through them all without looking at the answers first.

<div class="alert alert-block alert-info"><b>Tip:</b> Undoutedly in practice you will work with more complex examples and data than those found here.  These exercises are a very gentle introduction to the syntax of `pandas`.  But don't worry we have some more complicated data wrangling examples coming up!</div>

---

**Remember for any code where you wish to use `pandas` you need to import it.  We will also import `numpy` to help us generate some synthetic data to use.**

In [2]:
import pandas as pd
import numpy as np

## Exercise 1

**Task:**
* Create a `pandas.Series` named "number" with numbers from 1 to 10,000 inclusive.  
* The datatype of the series should be `np.uint32`     
* Check the length of the `Series`
* View the head and tail of the series to quickly validate your code has worked. 
    

In [None]:
# your code here ...

In [14]:
# example solution
column1 = pd.Series(np.arange(1, 10_000+1), name='number', 
                    dtype=np.uint32)
column1.head(2)

0    1
1    2
Name: number, dtype: uint32

In [15]:
column1.tail(2)

9998     9999
9999    10000
Name: number, dtype: uint32

In [10]:
len(column1)

10000

## Exercise 2

**Task**:
* Create a `pandas.DataFrame` with 5 rows and 5 columns.  
* The data contained in each columns should all be of type int64.
* Columns should be titled "col_1", "col_2" ... "col_5"
* Check the datatype, shape and column names using `.info()`
* View the `DataFrame`
* View only the first 2 rows in the `DataFrame`
* Create a new variable `col_4` of type `pandas.Series` that only the data in `col_4`

**Hints**:
* Data and can take any valid int64 value.
* One option is to generate a random matrix using a `numpy.random.Generator`

In [None]:
# your code here...

In [27]:
# example solution - data could be anything you want.
rng = np.random.default_rng(42)
matrix = rng.integers(0, 50_0000, size=(5,5))
df = pd.DataFrame(matrix, columns=[f'col_{i}' for i in range(1, 6)])

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   col_1   5 non-null      int64
 1   col_2   5 non-null      int64
 2   col_3   5 non-null      int64
 3   col_4   5 non-null      int64
 4   col_5   5 non-null      int64
dtypes: int64(5)
memory usage: 328.0 bytes


In [31]:
df.shape

(5, 5)

In [30]:
df

Unnamed: 0,col_1,col_2,col_3,col_4,col_5
0,44625,386978,327285,219439,216507
1,429298,42972,348684,100734,47088
2,263239,487811,367876,380569,358738
3,393032,256613,64056,419874,225192
4,250175,185399,91274,463382,390783


In [29]:
df.head(2)

Unnamed: 0,col_1,col_2,col_3,col_4,col_5
0,44625,386978,327285,219439,216507
1,429298,42972,348684,100734,47088


In [33]:
col_4 = df['col_4']
print(type(col_4))
print(col_4)

<class 'pandas.core.series.Series'>
0    219439
1    100734
2    380569
3    419874
4    463382
Name: col_4, dtype: int64


# Exercise 3:

