In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# check what version of packages are installed.
print("NumPy version",np.__version__, "pandas version ",pd.__version__, "seaborn version",sns.__version__  )  # '1.16.2'

# set print options with floating point precision if 4, summarise long arrays using threshold of 5, suppress small results
np.set_printoptions(precision=4, threshold=5, suppress=True)  # set floating point precision to 4
pd.options.display.max_rows=8 # set options to display max number of rows


NumPy version 1.24.3 pandas version  2.0.3 seaborn version 0.12.2


In [4]:
csv_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv'
df =  pd.read_csv(csv_url)

# Check the DataFrame looks ok
Having successfully read in the csv file into a pandas DataFrame object, panda's head and tail functions can be used to ensure the file has been read in and looks ok before exploring the DataFrame further below. As it is a very small file it can be quickly checked against the csv file source to check that everything looks ok. tail() is particularly useful for making sure a csv file has been read in properly as any problems usually manifest towards the end of the dataframe, throwing out the last number of rows but all looks well here.

In [5]:
print('the first rows in the dataset are as follows', df.head(5))
print('the final rows in the dataset are as follows', df.tail(5))

the first rows in the dataset are as follows    total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
the final rows in the dataset are as follows      total_bill   tip     sex smoker   day    time  size
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2


# Tidy data principles
The tips dataset illustrates the "tidy" approach to organising a dataset. The tips csv dataset has been imported into a pandas DataFrame object. Each column contains one variable and there are 244 rows in the dataFrame with one row for each of the 244 observations.

Again referring to Howard Seltman's book, data from an experiment are generally collected into a rectangular array most commonly with one row per experimental subject and one column for each subject identifier, outcome variable and explanatory variable. The Tips dataset follows this principle. Each of the columns have either numeric values for a particular quantitative variable or the levels for a categorical variable.

# What does the dataset look like?
Once loaded a dataset can be explored using the pandas and seaborn packages which work well together for analysing datasets such as this one. Pandas has many useful functions for slicing and dicing the data and can easily generate statistics such as the five number summary promoted by Tukey. Pandas can also be used to plot the data but this is where the seaborn package shines.

# Column and row names:
When the 'tips' csv dataset was read in, the column names were assigned using the first line of data in the csv file which is the default treatment with pandas.read_csv() if you have not set a header row or provided column names. You can however provide different column names by setting header=None in the read_csv function and then providing the names to use using the names argument, for example names= 'col-name1', 'col-name2' etc.

In [6]:
print('the index of the tips dataset', df.index)

the index of the tips dataset RangeIndex(start=0, stop=244, step=1)


There are 7 columns as expected and an index that begins at 0 for the first row. If the index of a DataFrame is not set to a particular column or some other value using index_col argument to read_csv , it will default to a sequence of integers beginning at 0 which is fine for the Tips dataset. The index goes from 0 (for the first row) up to 243 for the last row or observation in the dataset. The index is a range of integers from 0 up to but not including 244.

dtypes:
The dtypes (data types) have been inferred by read_csv but it is also possible to pass the data type when reading in the file.

In [7]:
print('The dtypes in the dataset are as:', end='n/n')
print(df.dtypes)

The dtypes in the dataset are as:n/ntotal_bill    float64
tip           float64
sex            object
smoker         object
day            object
time           object
size            int64
dtype: object


There are three numerical columns and 4 non-numerical object columns. The variables total_bill and tip are floats representing US dollar amounts while size is an integer representing the number of people in the party. The remaining columns have been read in as objects. Pandas uses the object dtype for storing strings, other arbitary objects or when there are mixed types in a column.

smoker is a binary categorical variable with two values yes or no. sex is also binary categorical variable with two values Male and Female. The 'day' and 'time' variables in this dataset could also be seen as categorical variables here as they have a limited number of distinct possible values. The time column here is not an actual time but instead just a binary categorical variable with two possible values dinner and lunch while day has four possible values: Thur, Fri, Sat and Sun for Thursday, Friday, Saturday and Sunday.

When a string variable consists of only a few values, converting such string variables to categorical data variable will actually save some memory. Specifying dtype='category' will result in an unordered Categorical whose categories are the unique values observed in the data. You can also use the astype on the dataframe to convert a dtype in a dataframe.

# Converting variables to type category:

In [8]:
df['sex']=df['sex'].astype('category')
df['smoker']=df['smoker'].astype('category')
df['day']=df['day'].astype('category')
df['time']=df['time'].astype('category')
print(*df.dtypes)

float64 float64 category category category category int64


# Checking for missing or N/A values
Next checking to see if there are any missing values or NA's in the dataset using isna()function and summing up the True or False boolean values to get a count of any missing values which in this case is zero as there are no missing or na values.

In [9]:
print(*df.isna().any())

False False False False False False False


Pandas has many useful functions for slicing and dicing the data. The data can be sorted and particular rows and columns can be selected in different ways.

Sorting by values:
While the head and tail functions show the top and bottom rows of a dataset as read in from the data source, the values may not be sorted. The sort_values function can be used to sort the dataframe in ascending or descending order by one or more variables to get an idea of the range of values in the dataset.

In [11]:
df.sort_values(by='tip').head()
df.sort_values(by='total_bill', ascending= False).head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
170,50.81,10.0,Male,Yes,Sat,Dinner,3
212,48.33,9.0,Male,No,Sat,Dinner,4
59,48.27,6.73,Male,No,Sat,Dinner,4


# Describing the Tips Dataset using statistics.
Exploratory data analysis generally involves both non-graphical methods which include calculation of summary statistics and graphical methods which summarises the data in a picture or a plot. These methods can be univariate where one variable is looked at at a time or multivariate where two or more variables are looked at together to explore relationships. Seltman[1] recommends performing univariate EDA on each of the components before going on to do multivariate EDA. The actual EDA performed depends on the role and type of each variable. I will first look at the summary statistics of the categorical variables and then the numerical variables. For categorical variables the range of values and the frequency or relative frequency of values are of interest with the fraction of data that fall into each category.