In [None]:
# Pandas is a package that is used for data analysis and data manipulation. 
# It’s used in a variety of packages and therefore understanding of it and its concepts is a crucial tool 
# for a Python programmer to learn. 
# In this chapter, we will introduce the pandas package from the basics up to some more advanced techniques. 
# However, before we get started with pandas, we will briefly cover numpy arrays which alongside dictionaries 
# and lists are concepts that should be understood to allow us to cover pandas.

In [None]:
# Numpy Arrays

# Numpy comes as part of the Anaconda distribution and is a key component in the scientific libraries within Python. 
# It is very fast and underpins many other packages within Python.
# We concentrate on one specific aspect of it, numpy arrays. 
# However, if you are interested in any of the machine learning libraries within Python, then numpy is certainly something
# worth exploring further.
# We can import it as follows: 

import numpy as np

In [None]:
# Why np? Its the standard convention used in the documentation.
# However you do not have to use that convention but we will. 
# In this chapter, we won’t cover everything to do with numpy but instead only introduce a few concepts 
# and the first one we will do is introduce an array. 
# An array in numpy is much like a list in Python. 
# If we want to create an array of integers 0–10 we can do so as follows:

number_array = np.array([1,2,3,4,5,6,7,8,9,10])
number_array

In [None]:
# It looks like we passed a list into the method array and that is basically what we did as we can define the same array.

number_list = [1,2,3,4,5,6,7,8,9,10]
number_array = np.array(number_list)
number_array

In [None]:
# You may be thinking it looks like a list and we can use a list to create it, why is it different from a list. 
# Very early on we looked at lists and operations on lists and we saw that using the common mathematical operators 
# either didn’t work or worked in an unexpected way.
# We will now cover these again and compare them to what happens when using an array in numpy. 
# We will begin by looking at addition:

number_list + number_list

In [None]:
number_array + number_array

In [None]:
# So, what we see is that with a list we have concatenation of two lists which is what we have seen before.
# However using an array we add together the two arrays and return a single array where the result is the 
# element wise addition. 
# Next, let’s consider what happens when we use the mathematical subtraction symbol.

number_list - number_list

In [None]:
number_list * number_list

In [None]:
number_array * number_array

In [None]:
# Again, we see that this operand doesn’t work on two lists but the arrays provide elementwise multiplication. 
# Now for completion we will look at the division operand on both lists and arrays.

number_list / number_list

In [None]:
number_array / number_array

In [None]:
# Unsurprisingly, we see that this doesn’t work on lists but on the arrays it performs elementwise division of the values 
# in the first array by those in the second. 
# This is great if we want to perform some mathematical operation on two lists which we cannot do and can be much faster. 
# For the examples we have covered so far we can achieve the same thing using lists in Python in one line via 
# list comprehension. 
# So the three examples that didn’t work here can be rewritten in as follows:

number_list

In [None]:
subtraction_list = [n - n for n in number_list]
subtraction_list

In [None]:
multiplication_list = [n * n for n in number_list]
multiplication_list

In [None]:
division_list = [n / n for n in number_list]
division_list

In [None]:
# Now, it should be noted that we have simply worked with operations on the same list which makes it easy to rewrite.
# However if we had two distinct lists we cannot use list comprehension and to rewrite using loops becomes more difficult. 
# Let’s demonstrate this by creating two random integer arrays.
# So in numpy, we can do this by using the random choice functionality as follows:

np.random.choice(10,10)

# Here, we generated an array of length 10 containing random numbers between 0 and 9.

In [None]:
# Now if we extend this example to 1 million random numbers and generate two arrays we can multiply them together as follows:

x = np.random.choice(100,1000000)
x

In [None]:
y = np.random.choice(100,1000000)
y

In [None]:
result = x*y
result

In [None]:
# What we have just done is complete 1 million multiplications instantly.
# If you try this using loops you would be waiting quite a bit longer than an instance! 
# We have seen how powerful numpy arrays are but how do we access elements of them. 
# Luckily we can access elements as we did for lists. 
# We will give some examples below applied to the result array from the previous example:

result[10]

In [None]:
result[10:20]

In [None]:
result[0]

In [None]:
result[-3:-1]

In [None]:
# You can see that we access elements in much the same we did for lists.
# Now having introduced the concept of an array alongside everything else 
# means we can start looking at Pandas starting with Series.

# Series

# We can import pandas as follows:

import pandas as pd

In [None]:
# Like before with numpy we use the alias pd which is the general convention used in the documentation for the package.
# The first thing that we will cover here is the concept of a Series, we shall demonstrate this first by an example:

point_dict = {"Bulgaria":45, "Romania":43, "Hungary":30, "Denmark":42}
point_dict

In [None]:
point_series = pd.Series(point_dict)
point_series

In [None]:
# We created a dictionary containing the keys of country names and the median age of citizens (source worldomometers.info) 
# in that country and passes then in to the Series method to create point series. 
# We can access the elements of the series as follows:

point_series[0]

In [None]:
point_series[1:3]

In [None]:
point_series[-1]

In [None]:
point_series[:-1]

In [None]:
point_series[[1,3]]

# point_series[1,3] will generate error.

In [None]:
# You can see we can access the first element as if it was a list using the position of the value we want. 
# We can also use the colon separated positional values as well as negative indices which we have covered earlier. 
# There is a different way we can access elements of the series and that is by passing a list of the positions we 
# want from the series. 
# So if we want the first and third elements we have the values 0 and 2 in the list.
# We have just accessed the values of the dict that we passed in to create the series but what about the keys and 
# what use do they have in the series? 
# What we will now show is that the series can also be accessed like it was a dictionary:

point_series.index

In [None]:
point_series["Bulgaria"]

In [None]:
# We see the series has an index which is the key of the dictionary and we can access the values using the 
# dictionary access approach we have seen earlier. 
# Now, when we covered dictionaries earlier we saw that we could try and access a value from a key that isn’t 
# in the dictionary and it would throw an exception which is the same for the series.

point_series["England"]

In [None]:
# Here, we have shown what happens but you can see we have used the method 'get' to try and access the value 
# for the index England. 
# As opposed to throwing an exception it just returns None.

point_series.get("England")

In [None]:
# Given we can now access elements of a series we will now show how you can operate on it. 
# Given the series is based on the concept of an array in numpy you can do much of what you would in numpy to the series. 
# So, now we will create series of random numbers and show how we can operate on them.

np.random.rand(10)

In [None]:
# Here, we have used numpy’s random methods to generate an array of 10 random numbers between 0 and 1. 
# This can be assigned to a series relatively easily.

random_series = pd.Series(np.random.rand(10))
random_series

In [None]:
# We can operate on this in much the same way as we do with a numpy array.

random_series_one = pd.Series(np.random.rand(10))
random_series_one

In [None]:
random_series_two = pd.Series(np.random.rand(10))
random_series_two

In [None]:
random_series_one + random_series_two

In [None]:
random_series_one / random_series_two

In [None]:
# Now that all looks the same as we have seen for arrays earlier.
# However one key difference is that we can operate on splices of the series.

random_series_one[3:] * random_series_two[:-1]

# What we see is that the multiplication is done on the elements of the series by index.
# So where we don’t have an index for both series we get an NaN shown.

In [None]:
# We earlier defined a series by using a dictionary but we can define a series using a list or array as follows:

pd.Series([1,2,3,4,5,6])

In [None]:
pd.Series(np.array([1,2,3,4,5,6]))

In [None]:
# As you can see the index is defined automatically by pandas.
# However if we want a specific index we can define one as follows:

pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])

In [None]:
pd.Series([1,2,3,4,5], index=[5,4,3,2,1])

In [None]:
# So, here we pass an optional list to the index variable and this gets defined as the index for the series. 
# It must be noted that the length of the index list must match that of the list or array that we want to make a series.

In [None]:
# DataFrames

# Having looked at series, we will now turn our attention to data frames which are arguably the most popular aspect 
# of pandas and are certainly what we use the most. 
# They are essentially an object that carries data in column and row format, so for many they will mimic what is held 
# in a spreadsheet or for others the content of a database table.
# We will start off by looking at how we create a DataFrame and like with a series there are many ways we can do it.

countries = ["United Kingdom","France","Germany","Spain","Italy"]
median_age = [40,42,46,45,47]
country_dict = {"name":countries, "median age":median_age}
country_df = pd.DataFrame(country_dict)

In [None]:
country_dict

In [None]:
country_df

In [None]:
# What we have done above is begin by setting up two lists, one containing names and another containing values. 
# These are then put into a dictionary with keys name and value.
# This dictionary is then passed into the 'DataFrame' method of pandas and what we get back is a DataFrame object 
# with column names of name and values. 
# We can see here that the index is automatically defined as 0–4 to correspond with the number of elements in each list.

countries = pd.Series(["United Kingdom","France","Germany","Spain","Italy"])
median_age = pd.Series([40,42,46,45,47])

country_dict = {"name":countries, "median age":median_age}
country_df = pd.DataFrame(country_dict)
country_df

# We can do the same using a dictionary of Series again assigning the Series to a dictionary and passing it into 
# the DataFrame method. 
# The same would happen if we used numpy arrays.

In [None]:
# Next, we create a DataFrame using a list of tuples where the data is now country name, median age and density of the country.

data = [("United Kingdom",40,281),("France",42,119),("Italy",46,206)]
data

In [None]:
data_df = pd.DataFrame(data)
data_df

In [None]:
# Here, we have created a list of tuples and we then pass those into the 'DataFrame' method and it returns a three column 
# by three row data frame. 
# Unlike before we not only have auto assigned index values but we also have auto assigned column names which aren’t the 
# most useful.
# However we will later show how to assign both. 
# The same applies here for a list of lists, list of series, or a list or arrays. 
# It also works for a list of dictionaries, however the behaviour is slightly different.

data = [{"country":"United Kingdom", "median age":40, "density":281}, 
        {"country":"France", "median age":42, "density":119}, 
        {"country":"Italy", "median age":46, "density":206}]
data

In [None]:
data_df = pd.DataFrame(data)
data_df

In [None]:
# When the list of dictionaries is passed in we get the same DataFrame.
# However now we have column names from the dictionary. 
# On the face of it everything seems like it works the same as for lists of lists. 
# However if we change some of the keys we get some different behaviour.

data = [{"country":"United Kingdom", "median age":40, "density":281},
       {"country":"France", "median age":42, "density":119},
       {"country":"Italy", "median":46, "density":206}]
data

In [None]:
data_df = pd.DataFrame(data)
data_df

In [None]:
# What we see here is that as every dictionary doesn’t have all the same keys pandas fills in the missing values with NaN. 
# Next, we will look at how to access elements of the dataframe.

data = [{"country":"United Kingdom", "median age":40, "density":281},
       {"country":"France", "median age":42, "density":119},
       {"country":"Italy", "median age":46, "density":206}]
data_df = pd.DataFrame(data)
data_df

In [None]:
data_df["country"]

In [None]:
data_df["country"][0]

In [None]:
data_df["country"][0:2]

In [None]:
# The first thing is that we have defined the data frame based on the list of dictionaries as we showed previously. 
# We then accessed all the elements of the column country by passing the name of the column as the key of the data frame. 
# We next showed how we could access the first element of that by adding the index of the value we wanted. 
# This is an important distinction as we aren’t asking for the first element, we are instead asking for the value of the 
# column with index value 0. 
# Lastly, we select the rows of the country with index 0 and 1 in the usual way we would for a list, but again we are asking
# for specific index rows.

In [None]:
data_df["country"][-1]

In [None]:
# Here, we asked for the last value of the column country as we would with a list.
# However it threw an error because there is no index −1 in the index for the data frame. 
# So we can’t treat the data frame as we would a list and we need to have an understanding of the index.
# For any dataframe we can find out the index and columns as follows:

data_df.index

In [None]:
data_df.columns

In [None]:
# Here, it shows the index starting at 0 and stopping at 3 with the step used each time. 
# It also shows the columns as a list of each name. 
# We can change the index of a data frame as follows:

data_df.index = ['a','b','c']
data_df

In [None]:
# Now if we want to access the first element of the country column we do so as follows:

data_df["country"]['a']

In [None]:
# Similarly if we want to change the column names of a data frame we do so as follows:

data_df.columns = ["country_name", "median_age", "density"]
data_df

In [None]:
# Given we have changed the index to strings, the question is how do we access the nth row if we don’t know what the index is. 
# Luckily there is a method of data frames called iloc which allow us to access the nth row by just passing in the number of 
# the row that we want. It works as follows:

data_df.iloc[0]

In [None]:
data_df.iloc[0:1]

In [None]:
data_df.iloc[0:2]

In [None]:
data_df.iloc[-1]

In [None]:
# We can see we can access rows from the data frame as if it was a list, which is cool.
# Let’s say we want to add a column of all ones to our DataFrames we can do so as follows:

data_df["ones"] = 1
data_df

In [None]:
# We can then delete a column in a couple of ways:

del data_df["ones"]

In [None]:
data_df

In [None]:
data_df["ones"] = 1
data_df.pop("ones")

In [None]:
data_df

In [None]:
# Here, we first used the 'del' method to delete the ones column, we then added it again and then used the 'pop' method 
# to remove the column. 
# Note that when we use the 'del' method, we simply delete from the DataFrame but using the 'pop' method we return the column we
# have popped as well as removing it from the DataFrame.

In [None]:
data_df["ones"] = 1
data_df

In [None]:
data_df["new_ones"] = data_df["ones"][1:2]
data_df

In [None]:
del data_df["new_ones"]
del data_df["ones"]

# What we see here is that when we use a partial column to form a new one, pandas knows to fill in the gaps with the NaN value. 
# There is another approach where we can insert a column and put it in a specific position:

In [None]:
data_df.insert(1,"twos",2)
data_df

# Here, we create a column containing the integer value 2 and puts it into position 1 
# (remember position 0 is the first position) under the title twos. 
# This gives us full control of how we add to the dataframe.

In [None]:
del data_df["twos"]
data_df

In [None]:
# So, now we have a grasp of what a DataFrame is we can start doing some cool things to it. 
# Let’s say we want to take all data where the value is less than 20.

data_df["density"] < 200

In [None]:
data_df[data_df["density"] < 200]

In [None]:
# What we have done here is test the values in data_df values column to see which ones are less than 20. 
# This concept is a pretty key, we test every element in the column to see which ones are less than 20 and return 
# a boolean column to show which ones meet the criteria.
# We can then pass this into the square brackets around a DataFrame and it returns the values where the condition is true. 
# We can do this on multiple boolean statements where anything true across all the statements is returned, this is shown below:

data_df[data_df["density"] < 250]

In [None]:
data_df["median_age"] > 42

In [None]:
data_df[(data_df["density"] < 250) & (data_df["median_age"] > 42)]

In [None]:
# It is important to note that the DataFrame isn’t changed in this instance it stays the same.
# To use the DataFrame that is returned from such an operation you need to assign it to a variable to use later.

(data_df["density"] < 250) & (data_df["median_age"] > 42)

In [None]:
data_df["test"] = (data_df["density"] < 250) & (data_df["median_age"] > 42)

In [None]:
data_df

In [None]:
del data_df["test"]

# Here, we have used the same test as used in the previous example and assigned it to a column which is 
# now part of the DataFrame. 
# We could do the same thing if we wanted to create another column that uses the data in dataframe.

In [None]:
data_df["density"].sum()

In [None]:
data_df["density_proportion"] = data_df["density"] / data_df["density"].sum()

In [None]:
data_df

In [None]:
# Here, we have divided all values in the value column with the sum of all the values in the column 
# which we can see is 606 to give us a new column of data. 
# We can also perform standard mathematical operations to a column. 
# Below we use the numpy exponential function to exponentiate every element of the column:

np.exp(data_df["density"])

In [None]:
# We can also loop across a dataframe as we have seen before with lists.

for n in data_df:
    print(n)

In [None]:
# This isn’t exactly what we thought we would get as it only loops across the column names.
# We really want to get into the meat of the dataframe to that we have to introduce the concept of transpose.

data_df

In [None]:
data_df.T

In [None]:
# What we have done here is turn the DataFrame the other way so now the columns are the index. 
# To loop over it we use the 'iteritems' method.

for n in data_df.T.iteritems():
    print(n)

In [None]:
# What we see here is that when we use the iteritems method over the data frame and at each instance 
# of the loop it returns a two element tuple. 
# The first element is the index and the second the values in the row stored in a series. 
# The better way to access it would be to assign each element to a variable allowing us to have better access to each part.

for ind, row in data_df.T.iteritems():
    print(ind)
    print(row)

In [None]:
for ind, row in data_df.T.iteritems():
    print(ind)
    print(row["country_name"])
    
# We assign the first element of the tuple to the variable ind and the series of the row in the variable row. 
# Then we access the country column of that row and show it here with the index. 
# Also we can avoid using the transpose of the dataframe by directly accessing the row via the 'iterrows' method.

In [None]:
for ind, row in data_df.iterrows():
    print(ind)
    print(row)

In [None]:
# We have looked at how to add columns to a data frame but now we will look at how to add rows. 
# The way we will consider is using the 'append' method on dataframes. 
# We do so as follows:

data = [{"country":"United Kingdom", "median_age":40, "density":281},
       {"country":"France", "median_age":42, "density":119},
       {"country":"Italy", "median_age":46, "density":206}]
data

In [None]:
data_df = pd.DataFrame(data)
data_df

In [None]:
new_row = [{"country":"Iceland", "median_age":37, "density":3}]

In [None]:
new_row_data_df = pd.DataFrame(new_row)

In [None]:
new_row_data_df

In [None]:
data_df.append(new_row_data_df)

In [None]:
data_df

In [None]:
# So, we setup the initial data as we have done earlier but here make a fresh copy of the original data. 
# We then setup a DataFrame of the new row and pass that into the 'append' method of the original dataframe. 
# What we then see is the DataFrame containing the new row.
# However, it has an index of zero which we already had in the original DataFrame. 
# We also see that when we call the DataFrame after this operation it no longer has the new row.
# If we look at the index problem we can resolve this by using the argument ignore_index as follows:

data_df.append(new_row_data_df, ignore_index=True)

In [None]:
data_df

In [None]:
# So that is sorted but what about the fact that the new row hasn’t become part of the data frame. 
# Well to get that to work we need to assign the data frame to a new variable as the append method doesn’t change 
# the original DataFrame. 
# We could re-assign the data frame to the same name data_df, however we would lose the memory of what we have done so
# we could assign it to a new variable.

new_data_df = data_df.append(new_row_data_df)
new_data_df

In [None]:
# Merge, Join, and Concatenation

# Initially, we will consider the concept of concatenating DataFrames. 
# The manner in which we can do this is to create a list of DataFrames and pass them into the 'concat' method.
# These DataFrames will build on what we have looked at in the previous chapter by using the following country data:

df1 = pd.DataFrame({"density":[119,206,240,94],
                   "median_age":[42,47,46,45],
                   "population":[65,60,83,46],
                   "population_change":[0.22,-0.15,0.32,0.04]},
                  index=["France","Italy","Germany","Spain"])

df2 = pd.DataFrame({"density":[153, 464, 36, 25],
                   "median_age":[38, 28, 38, 33],
                   "population":[1439, 1380, 331, 212],
                   "population_change":[0.39, 0.99, 0.59, 0.72]},
                  index=["China","India","USA","Brazil"])

df3 = pd.DataFrame({"density":[9, 66, 347, 103],
                   "median_age":[40, 29, 48, 25],
                   "population":[145, 128, 126, 102],
                   "population_change":[0.04, 1.06, -0.30, 1.94]},
                  index=["Russia","Mexico","Japan","Egypt"])

frames = [df1,df2,df3]
result = pd.concat(frames)
result

In [None]:
result2 = pd.concat([df1,df2,df3])
result2

In [None]:
# What we did was to create a list of DataFrames and then by passing them into the pd.concat method. 
# We get the result shown which is DataFrame with columns density, median_age, population, population_change 
# and rows indexed with country names. 
# But what if we did not have the index values as shown in the example:

df1 = pd.DataFrame({"density":[119, 206, 240, 94],
                   "median_age":[42, 47, 46, 45],
                   "population":[65, 60, 83, 46],
                   "population_change":[0.22, -0.15, 0.32, 0.04],
                   "country_name":['France', 'Italy', 'Germany', 'Spain']})

df2 = pd.DataFrame({"density":[153, 464, 36, 25],
                   "median_age":[38, 28, 38, 33],
                   "population":[1439, 1380, 331, 212],
                   "population_change":[0.39, 0.99, 0.59, 0.72],
                   "country_name":['China', 'India', 'USA', 'Brazil']})

df3 = pd.DataFrame({"density":[9, 66, 347, 103],
                  "median_age":[40, 29, 48, 25],
                  "population":[145, 128, 126, 102],
                  "population_change":[0.04, 1.06, -0.30, 1.94],
                  "country_name":['Russia', 'Mexico', 'Japan', 'Egypt']})

In [None]:
frames = [df1,df2,df3]
result = pd.concat(frames)
result

In [None]:
# Here, we see that the index is retained for each DataFrame which when created all have the index 0, 1, 2, 3. 
# To have an index 0–11 we need to use the ignore_index argument and set it to True.

result = pd.concat(frames, ignore_index=True)
result

In [None]:
# We can expand on this example by creating a list of DataFrames as we did previously and concat them together 
# but now we use the argument keys and set it to a list containing region one, region two, and region three.
# However, you can't add-up ignore_index=True together as the argument keys will be ignored.

result = pd.concat(frames, keys=["Region One", "Region Two", "Region Three"])
result

In [None]:
result.loc["Region Two"]

In [None]:
# In running the code what we see is that passing the keys in means we have what appears to be 
# another level of the DataFrame away from our index in the previous example which allows us 
# to select the one of the DataFrames used in the concat. 
# If we look at the index of the result we get the following:

result.index

In [None]:
# This is commonly referred to a multilevel index as the name would suggest and what it does is tell us 
# what the index value each element has. 
# So the levels are [“region_one”, “region_two”, “region_three”] and [0, 1, 2, 3], which are denoted in levels. 
# The index for each row is then determined using the label which has two lists of eight elements with the first 
# one having values 0, 1, 2 which corresponds to region one, region two and region three whilst the second has 
# values 0, 1, 2, 3 which refer to the levels 0, 1, 2, 3. 
# We could name these levels by using the optional name argument.

result = pd.concat(frames, keys=["Region_One","Region_Two","Region_Three"], names=["Region","Item"])
result

In [None]:
# In the previous example we used concat to concatenate the DataFrames together.
# However, there are other ways to use it which we will demonstrate now by concatenating urban 
# population percentage from France, Italy, Argentina, and Thailand to our initial DataFrame.

df1 = pd.DataFrame({"density":[119, 206, 240, 94],
                   "median_age":[42, 47, 46, 45],
                   "population":[65, 60, 83, 46],
                   "population_change":[0.22, -0.15, 0.32, 0.04]},
                  index=['France', 'Italy', 'Germany', 'Spain'])

df4 = pd.DataFrame({"urban_population":[82, 69, 93, 51]},
                  index=['France', 'Italy', 'Argentina', 'Thailand'])

In [None]:
pd.concat([df1,df4], axis=1, sort=False)

In [None]:
# Here, we have used concat with a list of DataFrames as we have done before but now we pass in the argument axis = 1. 
# Now, the axis argument says we concatenate on the columns, here 0 is index and 1 is columns. 
# So, we see commonality in the index with France and Italy, so we can add the extra column on and fill the values 
# that are not common with NaN. 
# Here, we have set the sort to be False which means we keep the order as if the two were joined one below the other. 
# If we set the value to be True we get the following:

In [None]:
pd.concat([df1,df4], axis=1, sort=True)

In [None]:
# We can see that with sort set to True we get the values sorted by index order. 
# Below we can also see what happens if we run the same query with the axis set to 0.

pd.concat([df1,df4], axis=0, sort=True)

# What we do is just concatenate the DataFrames one below each other with duplication 
# of the index for France and Italy.

In [None]:
# Concat also has an extra argument join that we will now explore and set the value to join.

pd.concat([df1,df4], axis=1, join="inner")

# As you can see we only have two rows returned which if you look back at the example
# before are the only two rows where the two DataFrames have values in columns. 
# The inner join is similar to that of a database join.
# However, here we don’t specify a key to use it on.

In [None]:
# Next, we add argument join_axes and set it to df1.index.

result = pd.concat([df1, df4], axis=1).reindex(df1.index)
result

# What we see is that all we get back only the values for in the index in df1 and show all
# the columns from the axis 1 argument. 
# By default the join_axes are set to False.

# Join_axes is deprecated. The supported way is now .reindex(df1.columns).

In [None]:
# Next, we will ignore the index by using the following arguments:

pd.concat([df1,df4], ignore_index=True, sort=True)

# Here, we see the result has lost index values from df1 and df2 and retained all the information 
# filling the missing values with NaN.

In [None]:
# We can achieve the same thing using the 'append' method directly on a DataFrame.

df1.append(df4, ignore_index=True, sort=True)

In [None]:
# The concat method is not only valid for DataFrames but can also work on Series.

df1 = pd.DataFrame({"density":[119, 206, 240, 94],
                   "median_age":[42, 47, 46, 45],
                   "population":[65, 60, 83, 46],
                   "population_change":[0.22, -0.15, 0.32, 0.04]},
                  index=['France', 'Italy', 'Germany', 'Spain'])

s1 = pd.Series([82, 69, 93, 51], index=['France', 'Italy', 'Germany', 'Spain'], name="urban_population")

In [None]:
df1

In [None]:
s1

In [None]:
pd.concat([df1,s1], axis=0)

In [None]:
pd.concat([df1,s1], axis=1)

In [None]:
# What is worth noting is that we give the series a name and then that is set to be 
# the name of the column when the two are concatenated together. 
# We could also pass in multiple series in the list and we will add a second series with world share percentage.

s2 = pd.Series([0.84, 0.78, 1.07, 0.60], index=['France', 'Italy', 'Germany', 'Spain'], name="world_share")
s2

In [None]:
pd.concat([df1,s1,s2], axis=1)

In [None]:
# Next, we pass in series as a list to create a DataFrame and by specifying keys we can rename the columns.

s1 = pd.Series([82, 69, 93, 51], index=['France', 'Italy', 'Germany', 'Spain'], name='urban_population')
s2 = pd.Series([0.84, 0.78, 1.07, 0.60], index=['France', 'Italy', 'Germany', 'Spain'], name='world_share')
pd.concat([s1,s2])

In [None]:
pd.concat([s1,s2], axis=1)

In [None]:
pd.concat([s1,s2], axis=1, keys=['urban population','world share'])

In [None]:
# Next, we take our three DataFrames from before and assign them to a dictionary each with a key. 
# The dictionary is then passed into concat.

df1 = pd.DataFrame({"density":[119, 206, 240, 94],
                    "median_age":[42, 47, 46, 45],
                    "population":[65, 60, 83, 46],
                    "population_change":[0.22, -0.15, 0.32, 0.04]},
                  index=['France', 'Italy', 'Germany', 'Spain'])

df2 = pd.DataFrame({"density":[153, 464, 36, 25],
                    "median_age":[38, 28, 38, 33],
                    "population":[1439, 1380, 331, 212],
                    "population_change":[0.39, 0.99, 0.59, 0.72]},
                  index=['China', 'India', 'USA', 'Brazil'])

df3 = pd.DataFrame({"density":[9, 66, 347, 103],
                    "median_age":[40, 29, 48, 25],
                    "population":[145, 128, 126, 102],
                    "population_change":[0.04, 1.06, -0.30, 1.94]},
                  index=['Russia', 'Mexico', 'Japan', 'Egypt'])

pieces = {"region 1":df1, "region 2":df2, "region 3":df3}
pd.concat(pieces)

In [None]:
# In using a dictionary we automatically create a DataFrame with a multilevel index where
# the first level is the key of the dictionary and the second level the index of the DataFrame.
# We next do exactly the same but here pass in an optional keys list.

pd.concat(pieces, keys=["region 2","region 3"])

In [None]:
# Having looked at the concat and append methods, we now consider how pandas deals with database styles merging.
# This is all done via the 'merge' method. 
# We will explain the specifics around each join type by example. 
# However, it is worth explaining the basics of database joins. 
# So, when we speak of database style joins, we mean the mechanism to join tables together via common values. 
# The way in which the tables will look will depend on the type of join. 
# For examples, inner, outer, right, and left joins.

In [None]:
# We will develop the example of country data to combine DataFrames that contain data relating to common countries
# and now add in the data for the countries relating to the percentage world share.

left = pd.DataFrame({"density":[119, 206, 240, 94],
                    "median_age":[42, 47, 46, 45],
                    "population":[65, 60, 83, 46],
                    "population_change":[0.22, -0.15, 0.32, 0.04],
                    "country":['France', 'Italy', 'Germany', 'Spain']})
left

In [None]:
right = pd.DataFrame({"world_share":[0.84, 0.78, 1.07, 0.60],
                     "country":['France', 'Italy', 'Germany', 'Spain']})
right

In [None]:
pd.merge(left, right, on="country")

In [None]:
# In this example, we join two DataFrames on a common key which in this case is the country name. 
# The result is a DataFrame with only one country column where both left and right are merged.
# In the next example, we look at the merge method with a left and right DataFrame but this time 
# will have two keys to join on which will be passed in as a list to the on argument.
# This allows us to join on multiple values being the same.

left = pd.DataFrame({"density":[119, 206, 240, 94],
                    "median_age":[42, 47, 46, 45],
                    "population":[65, 60, 83, 46],
                    "population_change":[0.22, -0.15, 0.32, 0.04],
                    "country":['France', 'Italy', 'Germany', 'Spain']})
left

In [None]:
right = pd.DataFrame({"world_share":[0.84, 0.78, 1.07, 0.60],
                     "population":[65, 60, 85, 46],
                     "country":['France', 'Italy', 'Germany', 'Spain']})
right

In [None]:
pd.merge(left, right, on=["country","population"])

In [None]:
# Here, we have joined on country and population and the resulting DataFrame is where both DataFrames share 
# the same country and population. 
# So, we lose one row of data from each DataFrame where we do not share the population and country on both.
# Next, we run the same code with an added argument which is how equal to left.

pd.merge(left, right, on=["country","population"], how="left")

In [None]:
# The result of this is what is known as a left join. 
# So we retain all the information of the left DataFrame and only the elements from the right DataFrame 
# with the same keys as the left one. 
# In this case we retain all information from the left DataFrame.
# Next we consider a right join using the same example as before:

pd.merge(left, right, on=["country","population"], how='right')

In [None]:
# Essentially, this does the same as the left join. 
# However, its now the left DataFrame that is joined onto the right one which is the reverse
# of what we saw with the left join.
# The next join to consider is the outer join and again for completeness, we use the previous
# example to show how it works.

pd.merge(left, right, how="outer", on=["country","population"])

In [None]:
# With the outer join its a combination of both the left and right joins. 
# So, we have more rows than are in each DataFrame as the join of left and right give different results 
# so we need all of these in the outer join result.
# The last how option we consider is the inner join.

pd.merge(left, right, how="inner", on=["country","population"])

In [None]:
# This join gives only the result where we have commonality on both the left and right DataFrame. 
# This is also the default when we do not pass how argument:

pd.merge(left, right, on=["country","population"])

In [None]:
# Next, we join two DataFrames with columns population and country but we join only on country using an outer join.

left = pd.DataFrame({"population":[65, 60, 83, 46],
                    "country":['France', 'Italy', 'Germany', 'Spain']})

right = pd.DataFrame({"population":[65, 60, 85, 46],
                     "country":['France', 'Italy', 'Germany', 'Spain']})

pd.merge(left, right, how="outer", on="country")

In [None]:
# What we see here if the columns are the same and not used in the join the names get changed. 
# Here, we now have population_x and population_y which could be problematic if you are assuming to operate 
# on the column population. 
# This makes sense as we need a way to distinguish the two and pandas takes care of it for us.
# Next, we do a merge using the indicator option set to True. 
# Here, we have two DataFrames with only a single column to merge on which is country and we want to do an outer join.

pd.merge(left, right, on="country", how="outer", indicator=True)

In [None]:
# What the result shows is how the join is done index by index position so this could be left, right, or both. 
# Here, we see that the join from one to the other is done on both.
# The merge method is a pandas method to take account of two DataFrames.
# However we can use a DataFrames join method to join one onto another.

left = pd.DataFrame({"density":[119, 206, 240, 94],
                    "median_age":[42, 47, 46, 45],
                    "population":[65, 60, 83, 46],
                    "population_change":[0.22, -0.15, 0.32, 0.04]},
                   index=['France', 'Italy', 'Germany', 'Spain'])
left

In [None]:
right = pd.DataFrame({"World_share":[0.84, 0.78, 1.07, 0.60]},
                    index=['France', 'Italy', 'Germany','United Kingdom'])
right

In [None]:
left.join(right)

In [None]:
# What we see is that the left DataFrame is retained and we join the right one where the keys in right match the keys in left. 
# Like with the merge we have the option how to join the DataFrames so we can specify that like we have seen earlier. 
# Using the same example previously we can show this.

left.join(right, how="outer")

In [None]:
# By using the outer join, we retain all the information from both DataFrames as we have seen when using 'merge' method. 
# And when there is no value in either one of the DataFrames, the cell will be filled in using NaN. 
# Using the same example with an inner join, following result will be shown:

left.join(right, how="inner")

In [None]:
# As expected, the inner join just retains where the DataFrames have common data 
# which here is for index France, Italy, and Germany.

# We can achieve the same result without using a 'how' if we pass in some different arguments to the 'merge' method. 
# These arguments are left_index and right_index here in setting them to True. 
# We are getting the same behaviour as for the join method with how set to inner.

pd.merge(left, right, right_index=True, left_index=True)

In [None]:
# Next, we use the argument ‘on’ with the join method when applied to the left DataFrame.

left = pd.DataFrame({"density":[119, 206, 240, 94],
                    "median_age":[42, 47, 46, 45],
                    "population":[65, 60, 83, 46],
                    "population_change":[0.22, -0.15, 0.32, 0.04],
                    "country":['France', 'Italy', 'Germany', 'Spain']})

right = pd.DataFrame({"world_share":[0.84, 0.78, 1.07, 0.60]},
                    index= ['France', 'Italy', 'Germany','United Kingdom'])

left.join(right, on="country")

In [None]:
# In specifying the 'on' column, which is country, we join the index of right on this column and we see that we now 
# have a DataFrame indexed by the first DataFrame. 
# This type of approach is what you may see when using databases and you want to join on the id of the column on 
# the respective value in another table. 
# This example can be extended to multiple values in the on argument, however to do this you would require multilevel 
# indexes which will be covered later in the book. 
# We can remove any NaN values by adding the how argument and setting it to inner doing an inner join as shown below.

left

In [None]:
right

In [None]:
left.join(right, on="country", how="inner")

In [None]:
# The next thing we will consider is the important concept of missing data. 
# We all hope to work with perfect datasets but the reality is we generally won’t and having the ability to work 
# with missing or bad data is an important one. 
# Luckily pandas offers some great tools for dealing with this and we begin by showing how to identify where we 
# have NaN in our dataset.

left = pd.DataFrame({"density":[119, 206, 240, 94],
                    "median_age":[42, 47, 46, 45],
                    "population":[65, 60, 83, 46],
                    "population_change":[0.22, -0.15, 0.32, 0.04],
                    "country":['France', 'Italy', 'Germany', 'Spain']})

right = pd.DataFrame({"world_share":[0.84, 0.78, 1.07, 0.60],
                     "population":[65, 60, 85, 46],
                     "country":['France', 'Italy', 'Germany', 'Spain']})

result = pd.merge(left, right, how="outer", on=["country","population"])
result

In [None]:
pd.isna(result)

In [None]:
pd.isna(result["density"])

In [None]:
result["median_age"].notna()

In [None]:
result.isna()

In [None]:
# Here, we have taken the DataFrames we have seen before and created a result DataFrame using the 'merge' method 
# with how set to outer. 
# What this has done is given us a DataFrame with NaN values and we can now demonstrate how you can find where 
# these values are within your DataFrame. 
# We first consider the pandas 'isna' method on a column of the DataFrame which tests each element to see what is 
# and what isn’t NaN. 
# To achieve the same thing, we can use the 'notna' method on a column or all of our DataFrame, or we could use 
# 'isna' method which does the opposite of 'notna'. 
# This makes it very easy to determine what is and what isn’t NaN in our DataFrame.

result

In [None]:
result["density"].dropna()

In [None]:
result["density"].notna()

In [None]:
result[result["density"].notna()]

In [None]:
# Taking the example one step further we can drop values from a column of the whole DataFrame by using the 'dropna' method. 
# For the column we only drop the one value that is NaN.
# However, across the whole DataFrame we remove any row that has NaN in it. 
# This may not be ideal and instead we may want to remove the row where one column has NaN and we can do that by passing 
# and columns notna to the whole DataFrame.

In [None]:
# DataFrame Methods

# Now, in the next example, we will show some of the methods we can apply to a DataFrame.
# Earlier we demonstrated the sum method, however pandas has lots more to offer and we will look 
# at some of the more common mathematical ones. 
# Here, we import the package seaborn and load the iris dataset that comes with it giving us the data in a DataFrame.

import seaborn as sns

iris = sns.load_dataset('iris')

In [None]:
iris.head(10)

In [None]:
iris["sepal_length"].head()

In [None]:
iris.tail()

In [None]:
iris.columns

In [None]:
# Following importing the package and the iris data we can access the top of the DataFrame by using head which 
# by default gives us the top five rows, we can use the tail method to get the bottom five rows. 
# We can get a defined number of rows by just passing the number into the head or tail method and if we want just 
# the columns back we can use the columns method.
# Having imported and accessed the data we now demonstrate some methods which we can apply.

iris.count()

In [None]:
iris.count().sepal_length

In [None]:
iris["sepal_length"].count()

In [None]:
len(iris)

In [None]:
# We can apply count to both the DataFrame and the column. 
# When applied to the DataFrame we return the length of each column. 
# We can also get the specific column length by either using the column name on the end of the count method or by accessing
# the column and then applying the count method. 
# If you want the number of rows in the DataFrame as a whole you can use the len method on DataFrame.

iris.corr()

In [None]:
iris.corr()["petal_length"]

In [None]:
iris.corr()["petal_length"]["sepal_length"]

In [None]:
iris.cov()

In [None]:
iris.cov()["sepal_length"]

In [None]:
iris.cov()["sepal_length"]["sepal_width"]

In [None]:
# The corr method applied to the DataFrame gives us the correlation between each variable 
# and we can limit that to one columns correlation with all others by passing the column 
# name or get the correlation between two columns by passing both column names. 
# You can also see the same applies with the cov method which calculates the covariance between variables.

In [None]:
# Next, we consider the cumsum method. This provides the cumulative sum as the columns ascend. 
# Now, for those columns of numeric type the value ascends as expected with the current value added 
# to the previous value and so on to create an increasing value. 
# The difference comes when we consider a character-based column. 
# The cumulative value here is just the concatenation of the values together with the results looking very strange. 
# To make things easier to read we can restrict what we show for the return of the method by specifying a list
# of the columns to show and as you can see we can even chain the tail command on the end.

iris.cumsum()

In [None]:
iris.cumsum().tail()

In [None]:
iris.columns

In [None]:
iris.cumsum()[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].tail()

In [None]:
# We use the 'describe' method which gives us a number of values namely the count, mean, standard deviation, 
# minimum, maximum and the 25, 50, and 75 percentiles.
# This method only works on columns with the type to calculate the values so we not the column species is not included. 
# We can also use this on individual columns, the manner in which we have done this in the example is not to use the 
# square bracket method to accessing a column but instead the dot approach where we can use dot and the column name 
# to access the value and then chain the describe method on the end.

iris.describe()

In [None]:
iris.sepal_length.describe()

In [None]:
# Next, we consider the max value. 
# Here, when applied to the entire DataFrame we get the max of every column where a maximum value can be obtained. 
# We also show that we can apply the method on a column in the same manner as we showed in the previous example.

iris.max()

In [None]:
iris.sepal_length.max()

In [None]:
iris.sepal_width.mean()

In [None]:
# columns

iris.mean(axis=0)

In [None]:
(iris.iloc[0,0] + iris.iloc[0,1] + iris.iloc[0,2] + iris.iloc[0,3]) / 4 

In [None]:
(iris.iloc[1][0] + iris.iloc[1][1] + iris.iloc[1][2] + iris.iloc[1][3]) / 4 

In [None]:
# rows

iris.mean(axis=1)

In [None]:
iris.mean(1).head()

In [None]:
iris.mean(1).tail()

In [None]:
# The next method we look at is the mean which is a common calculation that you may want to make 
# and as before we can apply it on an individual column and we have done so here using the dot syntax. 
# We then apply the mean method but now pass in a 0 or 1 referring to whether we want to apply across columns or rows. 
# There are a number of different methods that you can apply to a DataFrame and a list of some of the more useful ones is 
# given below:

# ● median: returns the arithmetic median
# ● min: returns the minimum value
# ● max: returns the maximum value
# ● mode: returns the most frequent number
# ● std: returns the standard deviation
# ● sum: returns the arithmetic sum
# ● var: returns the variance

# These are demonstrated as follows:

import seaborn as sns
iris = sns.load_dataset("iris")

In [None]:
iris.sepal_length.median()

In [None]:
iris.sepal_length.min()

In [None]:
iris.sepal_length.mode()

In [None]:
iris.sepal_length.max()

In [None]:
iris.sepal_length.std()

In [None]:
iris.sepal_length.sum()

In [None]:
iris.sepal_length.var()

In [None]:
# Missing Data

# We next consider methods we can apply across the DataFrame and how missing data is dealt with. 
# Here, we set the DataFrame up in the way we have done so far in the section and introduce some 
# NaN entries into the DataFrame.

data = pd.DataFrame({"A":[1, 2.1, np.nan, 4.7, 5.6, 6.8],
                    "B":[.25, np.nan, np.nan, 4, 12.2, 14.4]})
data

In [None]:
data.dropna(axis=0)

In [None]:
data.dropna(axis=1)

In [None]:
data

In [None]:
data.where(pd.notna(data), data.mean(), axis="columns")

In [None]:
data.fillna(data.mean()["B":"C"])

In [None]:
data.fillna(data.mean())

In [None]:
data.fillna(method="pad")

In [None]:
data.fillna(method="bfill")

In [None]:
data.interpolate()

In [None]:
data.interpolate(method="barycentric")

In [None]:
data.interpolate(method="spline", order=2)

In [None]:
data.interpolate(method="polynomial", order=2)

In [None]:
# So interpolate has a number of methods that you can use to interpolate between the NaN’s. 
# The default, which is executed with no argument is linear and what it does is ignore the index and treats 
# the values as equally spaced and looks to linearly fill between the values. 
# The remaining methods are all taken from scipy interpolate with a brief description given below:

# ● barycentric: Constructs a polynomial that passes through a given set of points.
# ● pchip: PCHIP one-dimensional monotonic cubic interpolation
# ● akima: Fit piecewise cubic polynomials, given vectors x and y
# ● spline: Spline data interpolator where we can pass the order of the spline
# ● polynomial: Polynomial data interpolator where we can pass the order of the polynomial

# For more information please refer to the scipy documentation.
# Next, we will consider interpolate on a series and show some of the optional arguments that we can pass.

ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan])
ser

In [None]:
ser.interpolate()

In [None]:
ser.interpolate(limit=1)

In [None]:
ser.interpolate(limit=1, limit_direction="backward")

In [None]:
ser.interpolate(limit=1, limit_direction="both")

In [None]:
ser.interpolate(limit_direction="both")

In [None]:
ser.interpolate(limit=1, limit_direction="both", limit_area="inside")

In [None]:
ser.interpolate(limit_direction="backward", limit_area="outside")

In [None]:
ser.interpolate(limit_direction="both", limit_area="outside")

In [None]:
# Initially, we interpolate using the default method which is linear, and for the rest of the example we use the default method 
# and vary the optional arguments. 
# Next, we pass the limit option and set it to 1 which says we can only interpolate one past any value so we still have NaN data
# in the Series. 
# We next keep limit set to 1 and add another argument limit direction and set it to backward. 
# What this does is only interpolate one value next to an existing value but unlike before does it going backwards. 
# We extend this in the next example by setting the limit direction to be both which interpolates both forwards and backwards 
# for one value.
# We next remove the limit one and keep limit direction to be both and see that all values are interpolated. 
# We next introduce the limit area option which has two options (aside from the default None) these are inside and outside. 
# When set to inside NaN’s are only filled when they are surrounded by valid values and when set to outside it only fills 
# outside valid values.
# Here, we show examples using each of these alongside limit direction and limit.

In [None]:
# Next, we introduce the 'replace' method.

iris.sepal_length.unique()

In [None]:
iris.sepal_width.unique()

In [None]:
iris.petal_length.unique()

In [None]:
iris.petal_width.unique()

In [None]:
x = iris.replace(2.3, 2.35)

In [None]:
x.petal_width.unique()

In [None]:
iris.species.unique()

In [None]:
iris.replace(['setosa', 'versicolor', 'virginica'],
             ['set', 'ver', 'vir'])

In [None]:
iris.replace(['setosa', 'versicolor', 'virginica'],
             ['set', 'ver', 'vir']).head(10)

In [None]:
iris.replace(['setosa', 'versicolor', 'virginica'],['set', 'ver', 'vir'])["species"].unique()

In [None]:
# Grouping

# Next, we introduce the concept of grouping the data via the groupby method. 
# Grouping data is a very powerful tool as we are able to create and operate on groups of data all at once.

In [None]:
iris.head()

In [None]:
groupby = iris.groupby("species")

In [None]:
groupby

In [None]:
groupby.mean()

In [None]:
groupby.median()

In [None]:
iris.groupby("species").max()

In [None]:
# Above we see the groupby applied to the iris dataset where we look to group the data based on the column species. 
# This then allows us to apply methods to the groupby object and we show the results of the sum and mean method applied to this. 
# What this is doing is applying this method to all the distinct types in species by all the columns in the dataset.

In [None]:
# We next demonstrate how to loop over a group. 
# Here, we set the DataFrame up as seen previously but now we loop over the group and in looping over it print the name of the
# group and what is in that group. 
# This gives us a good visualisation of what a groupby does to the data.

groupby = iris.groupby("species")
for name, group in groupby:
    print(name)
    print(group.head())

In [None]:
# Next, we introduce the aggregate method applied to a groupby. 
# We set the data up in the same way as seen earlier and then apply the aggregate method of the groupby object and
# inside it pass what we want to use for this aggregation. 
# In the example, we show the np.mean method which will be applied to the group.

grouped = iris.groupby("species")
grouped.aggregate(np.mean)

In [None]:
# We can extend the previous example by introducing the as_index argument. 
# Here, we use the same DataFrame from the previous examples and groupby species with as_index set to False. 
# What this does is create a group on species but retain species in the output as its column with the value we want to group by. 
# In this case, we apply the mean to the group and so all other columns are summed within the group.

iris.groupby("species", as_index=False).mean()

In [None]:
iris.groupby("species", as_index=True).mean()

In [None]:
# There are also methods that we can apply to a groupby object which can be useful.

grouped = iris.groupby("species")
grouped.size()

In [None]:
grouped["sepal_length"].describe()

In [None]:
# We can also apply different methods to the group and in this example we show multiple ways to apply the numpy methods 
# sum, mean, and std to our grouped data. 
# So we create the same DataFrame and group as in the last examples. 
# What we can then do is use the agg method with the arguments being a list of methods to be applied and what we see is
# that each method is applied on the group of data. 
# Lastly, here we can even apply a lambda function to the groupby.

grouped = iris.groupby("species")
grouped["sepal_length"].aggregate([np.sum, np.mean, np.std])

In [None]:
grouped.aggregate({lambda x: np.std(x, ddof=1)})

In [None]:
# What we next show is that you can get the largest and smallest values with a group by using 
# the 'nlargest' and 'nsmallest' methods. 
# Here, the integer value you pass in gives you the number of values returned. 
# What you see is that we get the largest and smallest per group.

grouped = iris.groupby("species")
grouped["sepal_length"].nlargest(3)

In [None]:
grouped["petal_length"].nsmallest(4)

In [None]:
# Our next example introduces the apply method which can be very useful. 
# Here, we set the data up in the manner seen before and groupby column species. 
# We can then use the apply method on the group to apply whatever we pass through it to the groupby. 
# It should be noted we can also use the apply method on DataFrames and Series. 
# Here we see we have applied a custom function to the groupby.

grouped = iris.groupby("species")
def f(group):
    return pd.DataFrame({"original": group, "demeaned": group - group.mean()})

In [None]:
grouped["petal_length"].mean()

In [None]:
iris.head()

In [None]:
grouped["petal_length"].apply(f).head()

In [None]:
# In the next example, we introduce a nice pandas method called qcut. 
# This cuts the data into equal sized buckets based on the arguments passed in. 
# Here, we apply the qcut on the data which is the column sepal length of the iris dataset by the list of 
# values 0, 0.25, 0.5, 0.75, and 1. 
# We assign the cut to the variable factor and when passed into the groupby the mean method gives the average 
# on each bucket showing what the min and max values in the buckets are.

factor = pd.qcut(iris["sepal_length"], [0, 0.25, 0.5, 0.75, 1])
factor.head()

In [None]:
iris.groupby(factor).mean()

In [None]:
# So far we have considered grouping on single columns. 
# However, we could also group on multiple columns. 
# However, the iris dataset isn’t best setup to allow us to do this so we instead load the tips dataset. 
# The tips dataset contains the following columns:

# ● total_bill
# ● tip
# ● sex
# ● smoker
# ● day
# ● time
# ● size

# Given some of the columns only have limited responses, it makes it ideal to do a group by multiple columns. 
# So next we group by sex and smoker.

tips = sns.load_dataset("tips")
tips.head()

In [None]:
tips.tail()

In [None]:
grouped = tips.groupby(["sex","smoker"])
grouped

In [None]:
grouped.mean()

In [None]:
grouped.sum()

In [None]:
grouped = tips.groupby(["sex","smoker","time"])
grouped.mean()

In [None]:
# Here, we see that when we group by two or three variables we increase the number of values that are returned 
# by creating more combinations within the groups.
# A similar approach to groupby is pivot table which is a common amongst spreadsheet users. 
# The concept is to take a combination of variables and group the data by it, which can seem similar to groupby. 
# The difference is you can extend upon this to create some more complicated groupings of your dataset. 
# We will demonstrate these by example.

tips.head()

In [None]:
pd.pivot_table(tips, index=["sex"])

In [None]:
pd.pivot_table(tips, index=["sex","smoker","day"])

In [None]:
pd.pivot_table(tips, index=["sex","smoker","day"], values=["tip"])

In [None]:
# In the above code, we use the tips as the dataset in each example and set a variety of index values starting 
# at just sex and extending to sex and smoker and then with the combination of sex, smoker and day. 
# In each example, when we pivot the data by default we end up with the average of the index across all of the 
# variables where we can take an average, so only numerical variables. 
# We don’t necessarily need to show all available variables as we have seen by passing the values argument as 
# a list of columns we want to include.

# As a default, when we use the pivot table command we get the average of the variables.
# However, we can control what we get back by passing the aggfunc argument which takes a list of 
# functions we want to apply to the data. 
# Note here that we pass the numpy mean function as well as the len from the standard Python library.

pd.pivot_table(tips, index=["sex","smoker","day"], values=["tip"], aggfunc=[np.mean,len])

In [None]:
# We can expand on this example by adding in the margins variable which then gives us the totals 
# associated with the rows and columns.

pd.pivot_table(tips, index=["sex","smoker"], values=["tip"], columns=["day"], aggfunc=[np.mean])

In [None]:
pd.pivot_table(tips, index=["sex","smoker"], columns=["day"], values=["tip"], aggfunc=[np.mean], margins=True)

In [None]:
# Reading in Files with Pandas

# The examples in the chapter have used the datasets from Seaborn and while this is useful, pandas has a lot of methods 
# to allow you to read in external files. 
# If we relate this back to earlier in the book where we read and manipulated data within Python we can see that these 
# methods are a lot easier to use. 
# They also allow us to write back to file. 
# To show how this works we will take one of the existing datasets that we have been using and write to a csv and read 
# that back in:

tips.head()

In [None]:
import os
x = os.getcwd()
x

In [None]:
file_name = "\\Files\\tips.csv"
file_path = x + file_name

In [None]:
tips.to_csv(file_path, index=False)

In [None]:
data = pd.read_csv(file_path)

In [None]:
data.head()

In [None]:
# What we have done is use the 'to_csv' method that the tips DataFrame has and write the data into a file called ‘tips.csv’.
# Note this will live in the directory where this is being run from or set for (like in above example), as well as the 
# file name the index argument is set to False which prevents the DataFrame index being written to the file with the 
# other columns. 
# Now, to read this back in, we use the read_csv method from pandas and this takes the csv file and creates a DataFrame
# with the contents of this file. 
# These methods are extremely useful as we do not have to worry about the process of writing to file or reading from file. 
# Alongside 'read_csv' method, we have other read methods for different file types and here are some of the more useful ones. 
# For a complete list, consult the pandas documentation.

# ● read_excel: reads in xls, xlsx, xlsm, xlsb, odf, ods and odt file types
# ● read_json: reads a valid json string

In [None]:
# If we take the examples in previous chapters, we created the following json and excel files called 
# boston.json and boston.xlsx. 
# We can read these into DataFrames using the following code:

file_name = "\\Files\\boston.json"
file_path = x + file_name

data = pd.read_json(file_path)
data

In [None]:
file_name = "\\Files\\boston.xlsx"
file_path = x + file_name

data = pd.read_excel(file_path)
data

In [None]:
# As you can see these methods provide very simple ways to load data from these common formats into DataFrames. 
# There is also a read_table method which we can use for general delimited files. 
# The 'read' methods also support operations like querying databases or even reading html but that is beyond the 
# scope of this book but well worth a look.
# The to methods of the DataFrames are pretty similar with support for many different formats and a selection given as follows:

# ● to_dict
# ● to_json
# ● to_html
# ● to_latex
# ● to_string

# These are all demonstrated as follows:

tips = sns.load_dataset("tips")
tips.head().to_json()

In [None]:
# We can also use some of these methods to write the data directly to file in the format with some examples below:

file_name = "\\Files\\tips.json"
file_path = x + file_name

tips.to_json(file_path)

In [None]:
tips.head().to_dict()

In [None]:
tips.head().to_html()

In [None]:
file_name = "\\Files\\tips.html"
file_path = x + file_name

tips.to_html(file_path)

In [None]:
tips.head().to_latex()

In [None]:
file_name = "\\Files\\tips.latex"
file_path = x + file_name

tips.to_latex(file_path)

In [None]:
file_name = "\\Files\\tips.tex"
file_path = x + file_name

tips.to_latex(file_path)

In [None]:
tips.head().to_string()

In [None]:
file_name = "\\Files\\tips.txt"
file_path = x + file_name

tips.to_string(file_path)

In [None]:
# These methods are really useful and for correctly formatted data are a very convenient way to read data into pandas 
# and also export it from pandas.
# What we have seen in this chapter is the advanced methods of pandas and how we can do complex data analysis. 
# We have shown how pandas allows us to manipulate data as if it were in a database allows us to join, merge, group, and pivot
# the data in a variety of ways.
# We have also covered some of the built in methods that pandas has and shown how we can deal with missing data. 
# The examples that we have covered have been rather simple in nature but pandas is powerful enough to deal with large datasets
# and that makes it an extremely powerful Python package. 
# It is also worth noting that pandas plays well with many other Python packages meaning a mastery of it is essential for a 
# Python programmer.