# Data Preparation (Milk Production in Ireland)

#### This section aims to study the milk production and utilisation (in million litres) from the farms of Ireland.

#### <font color=red>*Here, the data will be cleaned by removing unnecessary columns.*</font>

In [1]:
#Importing Libraries
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('Ireland_Milk_Production.csv')

In [3]:
df

Unnamed: 0,Month,Year,Amount,Unit
0,January,2020,176.2,million litres
1,February,2020,331.7,million litres
2,March,2020,725.7,million litres
3,April,2020,982.7,million litres
4,May,2020,1115.3,million litres
5,June,2020,1031.0,million litres
6,July,2020,985.0,million litres
7,August,2020,867.6,million litres
8,September,2020,725.4,million litres
9,October,2020,646.8,million litres


<font color=red>The column 'Unit' will be removed so that the column will not be included when the data is melted and pivoted making to create a less messy new data.</font>

In [4]:
to_drop = ['Unit']
df.drop(to_drop, inplace=True, axis=1)

In [5]:
df.shape

(34, 3)

In [6]:
df

Unnamed: 0,Month,Year,Amount
0,January,2020,176.2
1,February,2020,331.7
2,March,2020,725.7
3,April,2020,982.7
4,May,2020,1115.3
5,June,2020,1031.0
6,July,2020,985.0
7,August,2020,867.6
8,September,2020,725.4
9,October,2020,646.8


In [7]:
df.describe(include=object)

Unnamed: 0,Month
count,34
unique,12
top,January
freq,3


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Month   34 non-null     object 
 1   Year    34 non-null     int64  
 2   Amount  34 non-null     float64
dtypes: float64(1), int64(1), object(1)
memory usage: 944.0+ bytes


### <font color=blue>**Melting and Pivoting the dataset**</font>

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%config InlineBackend.figure_format = "retina"
sns.set_context("talk")

import warnings
warnings.filterwarnings("ignore")

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'



In [10]:
df= df.pivot(index="Year", columns="Month", values="Amount")
df

Month,April,August,December,February,January,July,June,March,May,November,October,September
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020,982.7,867.6,258.8,331.7,176.2,985.0,1031.0,725.7,1115.3,449.8,646.8,725.4
2021,1060.5,917.4,258.5,352.4,182.2,1017.3,1067.3,829.7,1181.0,465.9,649.6,776.7
2022,1054.6,919.0,,367.5,183.3,1022.8,1057.6,808.6,1166.0,,698.5,785.6


In [11]:
print(df.isnull().values.any())

True


In [12]:
print(df.isnull().sum())

Month
April        0
August       0
December     1
February     0
January      0
July         0
June         0
March        0
May          0
November     1
October      0
September    0
dtype: int64


<font color=red>There are 2 null values in the dataset because the data is still not available. These null values will be interpolated linearly based on the data available.</font>


In [13]:
df.interpolate(method='linear', direction = 'forward', 
inplace=True)

In [14]:
df.head()

Month,April,August,December,February,January,July,June,March,May,November,October,September
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020,982.7,867.6,258.8,331.7,176.2,985.0,1031.0,725.7,1115.3,449.8,646.8,725.4
2021,1060.5,917.4,258.5,352.4,182.2,1017.3,1067.3,829.7,1181.0,465.9,649.6,776.7
2022,1054.6,919.0,258.5,367.5,183.3,1022.8,1057.6,808.6,1166.0,465.9,698.5,785.6


It can be observed that the interpolation value is the same value as with the previous year. (November and December 2021, 2022)

In [15]:
#reseting index
df1=df.reset_index()
df1

Month,Year,April,August,December,February,January,July,June,March,May,November,October,September
0,2020,982.7,867.6,258.8,331.7,176.2,985.0,1031.0,725.7,1115.3,449.8,646.8,725.4
1,2021,1060.5,917.4,258.5,352.4,182.2,1017.3,1067.3,829.7,1181.0,465.9,649.6,776.7
2,2022,1054.6,919.0,258.5,367.5,183.3,1022.8,1057.6,808.6,1166.0,465.9,698.5,785.6


In [16]:
df1.to_excel('Ireland_Milk_pivot.xlsx', index = False)

In [17]:
melted_ie= df1.melt(id_vars='Year',
         var_name='Month', 
         value_name='Amount')

In [18]:
melted_ie

Unnamed: 0,Year,Month,Amount
0,2020,April,982.7
1,2021,April,1060.5
2,2022,April,1054.6
3,2020,August,867.6
4,2021,August,917.4
5,2022,August,919.0
6,2020,December,258.8
7,2021,December,258.5
8,2022,December,258.5
9,2020,February,331.7


In [19]:
melted_ie.to_excel('Ireland_Milk_melted.xlsx', index = False)