## Pandas Continued

In [None]:
import pandas as pd

df = pd.read_csv("Glasgow_Weather.csv",index_col="day")
df.dropna(inplace=True)
df

You can filter the data in the dataframe by using comparrison operations seen before in these python tutorials.

In [None]:
df["tempMin"] > 5

As you can see this has returned a series with True or False values. A series being a 1 dimensional array.

If the tempMin is > 5 it had given the index a value of True, if its equal or less than 5 it has given the index a value of False.

So how is this useful?

We can use this filter on the dataframe itself to return a dataframe containing only the rows where the index is True.

In [None]:
df[df["tempMin"] > 5]

As you can see the values of tempMin are all above 5.

Below are examples of the assignments you have seen before.

In [None]:
df["tempMin"] >= 5

In [None]:
df[df["tempMin"] >= 5]

In [None]:
df["tempMin"] < 5

In [None]:
df[df["tempMin"] <= 5]

In [None]:
df["summary"] == "Clear throughout the day."

In [None]:
df[df["summary"] == "Clear throughout the day."]

In [None]:
df["summary"] != "Clear throughout the day."

In [None]:
df[df["summary"] != "Clear throughout the day."]

You can create your own complex filters using the & (AND) and | (OR) comparrison operators.

These can sometimes be called masks.

In [None]:
filter_tempMin = df["tempMin"] > 10
filter_tempMax = df["tempMax"] < 12
filter_desc = df["desc"] == "partly-cloudy-day"


filter_mask1 = filter_tempMax & filter_tempMin
filter_mask1

As you can see filter_mask returns a series of True False values. 

In [None]:
df[filter_mask1]

In [None]:
filter_mask2 = (filter_tempMax & filter_tempMin) | filter_desc
filter_mask2

In [None]:
df[filter_mask2]

if you wan to find the inverse of the mask use the tilde character before the variable filter_mask.

In [None]:
df[~filter_mask2]

In [None]:
~(df["tempMin"] > 5)

In [None]:
df[~(df["tempMin"] > 5)]

What you learned last week can be used with these filters as well.

In [None]:
df.loc[filter_mask1,"desc":"windSpeed"]

In [None]:
df.loc[~filter_mask1,"desc":"windSpeed"]

We can also filter with a list but it requires the function isin (as in "is in").

In [None]:
descriptions = ["rain","clear-day"]

filter_desc = df["desc"].isin(descriptions)
filter_desc

In [None]:
df[filter_desc]

We can also filter the data on strings and if it contains a particular word.

In [None]:
filter_particular_word = df["summary"].str.contains("Possible", na=False)
filter_particular_word

In [None]:
df[filter_particular_word]

A quicker way to find all the data between 2 values is the function between. This could be done by a complex filter of less than and greater than but this reduces complexity and time spent programming.

In [None]:
df["tempMin"].between(5,10)

In [None]:
df[df["tempMin"].between(-2,5)]

Now that you can filter data, lets see how was can sort the data into ascending and desending order.

In [None]:
df.sort_values(by="tempMin")

In [None]:
df.sort_values("tempMin")

As you can see the values of tempMin have been sorted in asecnding order. If you want descending order then you need to pass a value of False to ascending variable.

In [None]:
df.sort_values(by="tempMin", ascending=False)

How do we sort the index?

In [None]:
df.sort_value("day")

As you can see an attribute error is shown, as the index has its own sorting function, shown below.

In [None]:
df.sort_index()

In [None]:
df.sort_index(ascending=False)

You can also sort data on multiple columns and the order in which you pass the columns names dictates the hierarchy of the sort. 

In [None]:
df.sort_values(by=["desc","summary"])

In [None]:
df.sort_values(by=["summary","desc"])

As you can see the indexing of the list changes which column gets sorted first.

You can also choose each column if it is ascending or descending order by using a list in the asecnding variable.

In [None]:
df.sort_values(by=["summary","desc"], ascending=[True,False])

In [None]:
df.sort_values(by=["summary","desc"], ascending=[False,True])

Finally you can find the smallest and largest values using the functions nsmallest and nlargest and passing the number of values you want and the column name.

In [None]:
df.nsmallest(10, "tempMax")

In [None]:
df.nlargest(20,"cloudCover")