In [1]:
%reset

# Use the Python Pandas library for data manipulation and analysis.

We are following the convention to import Pandas in Python with the pd alias.

# hi

In [3]:
import numpy as np
import pandas as pd

With Pandas you can use the read_csv(path_to_file) to automatically read and parse your csv.
The only required arguement is the path to the csv file as a string or path object. The file can
be hosted locally on your computer or online. Documentation for this method can be found
here at https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.read_csv.html.  

In this exmple we'll use the board thickness dataset.

There are many ways to read the dataset. Use the .read_csv() to read in the dataset and store
it as a Dataframe object in the variable that we choose to name it, here all_boards

# Reading Data

**Read from os**: reading from  the operating system

The os commands can be used when the data is in the same folder as the python file - it
defines the relative path between the files

In [4]:
import os
all_boards1 = pd.read_csv(os.getcwd() + os.sep + "six-point-board-thickness.csv")

In [82]:
all_boards1

Unnamed: 0,Date.Time,Pos1,Pos2,Pos3,Pos4,Pos5,Pos6
0,2010-02-18 3:04,1761,1739,1758,1677,1684,1692
1,2010-02-18 3:37,1801,1688,1753,1741,1692,1675
2,2010-02-18 3:37,1697,1682,1663,1671,1685,1651
3,2010-02-18 3:37,1679,1712,1672,1703,1683,1674
4,2010-02-18 3:37,1699,1688,1699,1678,1688,1705
...,...,...,...,...,...,...,...
4995,2010-02-18 13:15,1690,1701,1690,1694,1735,1695
4996,2010-02-18 13:15,1703,1674,1666,1694,1659,1728
4997,2010-02-18 13:16,1657,1667,1675,1654,1648,1609
4998,2010-02-18 13:16,1746,1717,1638,1723,1703,1706


**Read from open**

Import the dataset directly into your working directory and use it

In [6]:
all_boards = pd.read_csv('six-point-board-thickness.csv')

**Read from online URL**

You can import from a URL with an import request, which will save he .csv
to your working directory. 

# Explore the Data that you have read into python.  

Since you read it as using pandas, it will be a dataframe. 

# Dataframes

The dataframe provides conveinent built in ways to query the dataset, manipulate the data, and analyze the data.  Like the dataset, the dataframe in this case has 5000 rows and 7 columns.    
The type() function is used to get the type of the object and check that it is a Pandas Dataframe.  
The len() function will show the number of rows.  
There are 3 attributes to describe the size of the dataframe:  
> The .shape attribute will show the dimensionality.  The result is a tuple containing the number of rows and columns.  
The .ndim atribute will show the number of dimensions of the dataframe.  
The .size attribute will show the total number of values.  

There are 3 components of the dataframe: This is what makes the arrangement of a data matrix tidy. First you should arrange, or tidy, your data into the form that you want.  
> The columns names can be found with the .columns attribute.    
The .index attribute returns the row labels  
The .values attribute returns the dataframe values.  You can also use the .to_numpy() to create a 2D values array. 

Take a look at the first 5 rows of the dataframe with .head()

In [7]:
type(all_boards)
#len(all_boards)
#all_boards.shape
#all_boards.ndim
#all_boards.size
#all_boards.columns
#all_boards.index
#all_boards.to_numpy()
#all_boards.head()

pandas.core.frame.DataFrame

# Get to know the dataframe. 
You have imported a CSV file and had a first look at the data.  Now let's learn to examine the data systematically.  

First, take a look at the different data types that the dataframe contains.  The columns of the dataframe contain specific data types.  Remember that a coulmn of a dataframe is a series object.  You can display all coumns with the data types with .info()  

Or use the attribute .dtypes to return a series object with column names as labels ad corresponding data types as values.  

Pandas uses the NumPy library to work with these data types.

In [9]:
all_boards.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Date.Time  5000 non-null   object
 1   Pos1       5000 non-null   int64 
 2   Pos2       5000 non-null   int64 
 3   Pos3       5000 non-null   int64 
 4   Pos4       5000 non-null   int64 
 5   Pos5       5000 non-null   int64 
 6   Pos6       5000 non-null   int64 
dtypes: int64(6), object(1)
memory usage: 273.6+ KB


## Accessing elements and manipulating data

In this case, we're only interested in the board positions. The time the measurements were taken don't matter to us so we can drop that column (Date.Time) from the dataframe using the [.drop() method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html). Note that all of these data manipulation operators return another dataframe object so the same methods are applicable to the transformed data as well.

In [86]:
boards_no_time = all_boards.drop(columns=["Date.Time"])
print(boards_no_time)

      Pos1  Pos2  Pos3  Pos4  Pos5  Pos6
0     1761  1739  1758  1677  1684  1692
1     1801  1688  1753  1741  1692  1675
2     1697  1682  1663  1671  1685  1651
3     1679  1712  1672  1703  1683  1674
4     1699  1688  1699  1678  1688  1705
...    ...   ...   ...   ...   ...   ...
4995  1690  1701  1690  1694  1735  1695
4996  1703  1674  1666  1694  1659  1728
4997  1657  1667  1675  1654  1648  1609
4998  1746  1717  1638  1723  1703  1706
4999  1668  1680  1668  1669  1651  1629

[5000 rows x 6 columns]


Columns can be access by using their name in square brackers \[\] while rows can be access using their row (index) number and the [.iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) property. use .loc for label indexing

iloc works similar to indexing on a list, the same type of slicing can be used \[start:end\].

In [87]:
pos1 = boards_no_time['Pos1']
print(pos1)

0       1761
1       1801
2       1697
3       1679
4       1699
        ... 
4995    1690
4996    1703
4997    1657
4998    1746
4999    1668
Name: Pos1, Length: 5000, dtype: int64


In [88]:
# Get the first three rows of the dataframe
first_three_rows = boards_no_time.iloc[0:3]
print(first_three_rows)

   Pos1  Pos2  Pos3  Pos4  Pos5  Pos6
0  1761  1739  1758  1677  1684  1692
1  1801  1688  1753  1741  1692  1675
2  1697  1682  1663  1671  1685  1651


You can use the .head() or .tail() methods to get a certain amount of elements from the top (head) or bottom (tail) of the dataframe. Syntactically they're the same, so I'll only show an example of one

In [89]:
# Get the first three rows of pos 1 using head
first_three_rows = boards_no_time.head(3)
print(first_three_rows)

   Pos1  Pos2  Pos3  Pos4  Pos5  Pos6
0  1761  1739  1758  1677  1684  1692
1  1801  1688  1753  1741  1692  1675
2  1697  1682  1663  1671  1685  1651


Row and column access can also be combined together. The head and tail methods would work here as well.

In [90]:
first_column_three_rows = boards_no_time['Pos1'].iloc[0:3]
print(first_column_three_rows)

0    1761
1    1801
2    1697
Name: Pos1, dtype: int64


It is also possible to filter the dataframe based on the value in certain columns. Filtering the values returns a new dataframe with just the values that meet the condition (using the [.loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) property). 

In [91]:
# Get all columns (the whole dataframe) rows which have pos1 < 1650
less_1650 = boards_no_time.loc[boards_no_time['Pos1'] < 1650]
print(less_1650)

      Pos1  Pos2  Pos3  Pos4  Pos5  Pos6
10    1546  1697  1654  1688  1668  1703
11    1524  1668  1594  1686  1741  1730
23    1608  1664  1641  1651  1633  1594
48    1636  1650  1649  1666  1673  1665
55    1643  1649  1624  1667  1662  1660
...    ...   ...   ...   ...   ...   ...
4932  1632  1608  1634  1667  1674  1706
4933  1619  1692  1696  1711  1742  1728
4961  1622  1627  1641  1664  1663  1660
4969  1645  1680  1658  1747  1793  1808
4974  1648  1612  1620  1637  1611  1694

[539 rows x 6 columns]


In [92]:
# Get only pos1 values for rows with pos1 < 1650
pos1_less_1650 = boards_no_time.loc[boards_no_time['Pos1'] < 1650, ['Pos1']]
print(pos1_less_1650)

      Pos1
10    1546
11    1524
23    1608
48    1636
55    1643
...    ...
4932  1632
4933  1619
4961  1622
4969  1645
4974  1648

[539 rows x 1 columns]


Separate conditionals can be combined using boolean logic. In this case each conditional needs to be written in round brackets and the symbols change slightly. The element-wise logical symbols for use in these statements are:
* and: &
* or: |
* not: ~ 

In [93]:
# Get all columns that have 1600 < pos1 < 1650
between_1600_1650 = boards_no_time.loc[(boards_no_time['Pos1'] > 1600) & (boards_no_time['Pos1'] < 1650)]
print(between_1600_1650)

      Pos1  Pos2  Pos3  Pos4  Pos5  Pos6
23    1608  1664  1641  1651  1633  1594
48    1636  1650  1649  1666  1673  1665
55    1643  1649  1624  1667  1662  1660
84    1618  1603  1607  1652  1666  1657
86    1645  1685  1694  1705  1644  1542
...    ...   ...   ...   ...   ...   ...
4932  1632  1608  1634  1667  1674  1706
4933  1619  1692  1696  1711  1742  1728
4961  1622  1627  1641  1664  1663  1660
4969  1645  1680  1658  1747  1793  1808
4974  1648  1612  1620  1637  1611  1694

[465 rows x 6 columns]


Alternately, the .query() method can be used to succinctly query the dataframe.

# With a DataFrame you can use numpys and pandas to find statistical summary data

In [95]:
boards_no_time

Unnamed: 0,Pos1,Pos2,Pos3,Pos4,Pos5,Pos6
0,1761,1739,1758,1677,1684,1692
1,1801,1688,1753,1741,1692,1675
2,1697,1682,1663,1671,1685,1651
3,1679,1712,1672,1703,1683,1674
4,1699,1688,1699,1678,1688,1705
...,...,...,...,...,...,...
4995,1690,1701,1690,1694,1735,1695
4996,1703,1674,1666,1694,1659,1728
4997,1657,1667,1675,1654,1648,1609
4998,1746,1717,1638,1723,1703,1706


Can use numpy for 1 vector of data, array data

In [101]:
np.mean(boards_no_time["Pos1"])

1689.3934

In numpy, the default is for standard deviation is population standard deviation, remember if you want to have a sample standard deviation then you need to divide the sum of the sample values by (n-1).  In numpy you do this by stating that the dof=1.  

In [103]:
np.std(boards_no_time["Pos1"],ddof=1)

43.8385241503399

Can use pandas for dataframe data.  For pandas the default is sample summary data. 

In [100]:
boards_no_time["Pos1"].mean()

1689.3934

In [104]:
boards_no_time.describe()

Unnamed: 0,Pos1,Pos2,Pos3,Pos4,Pos5,Pos6
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1689.3934,1680.9126,1678.2108,1687.351,1682.8952,1681.5778
std,43.838524,41.367021,47.637345,42.425716,40.115778,45.071714
min,880.0,1333.0,1268.0,1252.0,1311.0,1282.0
25%,1670.0,1662.0,1657.75,1666.0,1663.0,1659.0
50%,1685.0,1677.0,1677.0,1683.0,1679.0,1679.0
75%,1705.0,1695.0,1697.25,1708.0,1697.25,1702.0
max,1902.0,1838.0,1840.0,1852.0,1862.0,1865.0
