# Calculate basic Statistics in Pandas

In this section we will see how to calculate basic statistics of a data set with Pandas.

We will use the same data sets as in the previous notebooks.

In [1]:
import pandas as pd

df = pd.read_csv('./data/student_debt.csv')

Looking at basic statistics of the data such as the minimum, maximum, average value of a column can already give us a lot of information and insight about the characteristics of the data.

Pandas has an easy way of retrieving all statistics from a dataframe with the `.describe()` method.

In [2]:
df.describe()

Unnamed: 0.1,Unnamed: 0,No.people,Sum,Average,Median
count,72.0,72.0,72.0,72.0,72.0
mean,37.5,402.4375,5.045833,10.547222,6.606944
std,20.92845,366.874162,5.049165,3.751093,2.435554
min,2.0,0.1,0.0,2.2,1.4
25%,19.75,32.3,0.1,8.825,5.0
50%,37.5,411.15,4.55,11.8,6.85
75%,55.25,632.425,8.575,12.6,8.25
max,73.0,1413.7,19.3,16.5,10.9


Here we can see the mean (average), minimum, maximum values, as well as a few indicative percentiles. Notice that the `describe()` method retrieves a dataframe, that you can interact with just like any other data frame.
One difference is that the index of each row is the type of statistic it displays. 
We can use `.loc[]` to access a certain row's all values (in this case we can get a certain type of statistic for all columns)

In [3]:
df.describe().loc['mean'] #get the average for all values

Unnamed: 0     37.500000
No.people     402.437500
Sum             5.045833
Average        10.547222
Median          6.606944
Name: mean, dtype: float64

If we are curious about one specific statistics (e.g.; we only want to know the average, the minimum or maximum values etc..), Pandas has built in methods to get all of these. 
These methods can be called on a Dataframe (this returns the statistic for all columns separately), or on a Pandas Series. This returns a single number and is very useful in the calculation of other possible metrics.

These methods are the following:
- `.min()`: returns minimum
- `.max()`: returns maximum
- `.mean()`: returns average (mean)
- `.std()`: returns standard deviation (its a statistical quantity measuring how 'spread out' the values of a column are)
- `.count()`: Returns the number of values in a column (can be useful if we have missing values, as it only counts non-missing values)

In [4]:
df.min() # Minimum values of all columns

Unnamed: 0                        2
Period                         2011
Characteristic    65 years or older
No.people                       0.1
Sum                             0.0
Average                         2.2
Median                          1.4
dtype: object

In [5]:
df['No.people'].max() # Maximum of a specific column

1413.7

# Exercises

In these exercises we will use the `renewable_electricity.csv` dataset.

### Exercise 1

Calculate the minimum and maximum of _Gross production_ in the data.

In [6]:
#------------ Exercise code comes here ------------- #

### Exercise 2

In data analysis we are sometimes interested in the range of a column. The range is the difference between the smallest and largest value in the column.

Calculate the range for _Capacity (megawatt)_ and _Installations in year_.

In [7]:
#------------ Exercise code comes here ------------- #

### Exercise 3

What was the average _Gross production_ throughout all years for the total wind energy? (hint: for total wind energy, the _Source_ column should say _wind-total_)

In [8]:
#------------ Exercise code comes here ------------- #

### Bonus exercise

This exercise really puts all your knowledge to the test, so don't worry if you cannot do it just yet :)

What was the maximum number of installations in each year? Hint: you should use a list, or dictionary to store your results. If you are not familiar with these, don't be afraid to ask one of the Data mentors, we are happy to help! :)

As a bonus, you could try and find out which _Source_ belongs to this number (i.e. what type of electrical power plant had the most number of installations in every year)

In [9]:
#------------ Exercise code comes here ------------- #