# Data Load and Wrangle

## The goal of the exercise is to load a dataset, clean and transform and compute descriptive statistics

In [None]:
# This piece of code enables to display multiple output within a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

<div class="alert alert-block alert-warning">
    
## Task Instructions 
### Load one of the built in dataset
### Transform the data as required 
### From a perspective of a journalist interested in writing a story about the data, **_ask and answer_** **3** questions that summarize the sample. 
    
#### Please *_only use summary statistics_* (mean, median, mode, standard deviation, variance, range,..). The statistics can be computed by groups.

#### Reference :https://kolesnikov.ga/Datasets_in_Python/

</div>

In [1]:
import statsmodels.api as sm 
copper = sm.datasets.copper
print (copper.DESCRLONG)

This data describes the world copper market from 1951 through 1975.  In an
example, in Gill, the outcome variable (of a 2 stage estimation) is the world
consumption of copper for the 25 years.  The explanatory variables are the
world consumption of copper in 1000 metric tons, the constant dollar adjusted
price of copper, the price of a substitute, aluminum, an index of real per
capita income base 1970, an annual measure of manufacturer inventory change,
and a time trend.



In [3]:
dataset_copper = copper.load_pandas()
df_copper = dataset_copper.data
df_copper.head()

Unnamed: 0,WORLDCONSUMPTION,COPPERPRICE,INCOMEINDEX,ALUMPRICE,INVENTORYINDEX,TIME
0,3173.0,26.56,0.7,19.76,0.98,1.0
1,3281.1,27.31,0.71,20.78,1.04,2.0
2,3135.7,32.95,0.72,22.55,1.05,3.0
3,3359.1,33.9,0.7,23.06,0.97,4.0
4,3755.1,42.7,0.74,24.93,1.02,5.0


In [4]:
sm.datasets.copper.load_pandas().data

Unnamed: 0,WORLDCONSUMPTION,COPPERPRICE,INCOMEINDEX,ALUMPRICE,INVENTORYINDEX,TIME
0,3173.0,26.56,0.7,19.76,0.98,1.0
1,3281.1,27.31,0.71,20.78,1.04,2.0
2,3135.7,32.95,0.72,22.55,1.05,3.0
3,3359.1,33.9,0.7,23.06,0.97,4.0
4,3755.1,42.7,0.74,24.93,1.02,5.0
5,3875.9,46.11,0.74,26.5,1.04,6.0
6,3905.7,31.7,0.74,27.24,0.98,7.0
7,3957.6,27.23,0.72,26.21,0.98,8.0
8,4279.1,32.89,0.75,26.09,1.03,9.0
9,4627.9,33.78,0.77,27.4,1.03,10.0


#  Question 1: What is the average World Consumption of the copper market in 25 years? What is the average distance between the each value of the data in the set and the average of AlumPrice?

In [5]:
df_copper.mean()

WORLDCONSUMPTION    5433.6320
COPPERPRICE           37.1684
INCOMEINDEX            0.8664
ALUMPRICE             24.2920
INVENTORYINDEX         1.0056
TIME                  13.0000
dtype: float64

# The mean or average of World Consumption is 5433.6320

In [6]:
df_copper.std()

WORLDCONSUMPTION    1669.628887
COPPERPRICE            6.911005
INCOMEINDEX            0.144363
ALUMPRICE              2.354434
INVENTORYINDEX         0.034651
TIME                   7.359801
dtype: float64

# The average distance between the values of the data in the set and the mean is the standard deviation of AlumPrice is 2.354434

# Question 2: What year that has the highest World Consumption?

In [7]:
df_copper.min()

WORLDCONSUMPTION    3135.70
COPPERPRICE           26.56
INCOMEINDEX            0.70
ALUMPRICE             18.56
INVENTORYINDEX         0.94
TIME                   1.00
dtype: float64

# The maximum point is 3135.70. Therefore, 1953 has the highest World Consumption. 

# Question 3: What year that has the lowest Alum price?

In [8]:
df_copper.max()

WORLDCONSUMPTION    8480.30
COPPERPRICE           52.27
INCOMEINDEX            1.12
ALUMPRICE             27.40
INVENTORYINDEX         1.08
TIME                  25.00
dtype: float64

# The minimum point for AlumPrice is 27.40. Therefore, 1960 has the lowest Alum Price. 

### Displaying first 5 rows of the dataset

In [16]:
df_copper.head()

Unnamed: 0,WORLDCONSUMPTION,COPPERPRICE,INCOMEINDEX,ALUMPRICE,INVENTORYINDEX,TIME
0,3173.0,26.56,0.7,19.76,0.98,1.0
1,3281.1,27.31,0.71,20.78,1.04,2.0
2,3135.7,32.95,0.72,22.55,1.05,3.0
3,3359.1,33.9,0.7,23.06,0.97,4.0
4,3755.1,42.7,0.74,24.93,1.02,5.0


In [19]:
df_copper.shape

(25, 6)

# Displaying all the characteristics of the dataset

In [20]:
df_copper.describe()

Unnamed: 0,WORLDCONSUMPTION,COPPERPRICE,INCOMEINDEX,ALUMPRICE,INVENTORYINDEX,TIME
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,5433.632,37.1684,0.8664,24.292,1.0056,13.0
std,1669.628887,6.911005,0.144363,2.354434,0.034651,7.359801
min,3135.7,26.56,0.7,18.56,0.94,1.0
25%,3905.7,32.38,0.74,22.75,0.98,7.0
50%,5327.9,36.24,0.83,24.98,1.0,13.0
75%,6974.3,42.7,1.0,26.01,1.03,19.0
max,8480.3,52.27,1.12,27.4,1.08,25.0


In [22]:
df_copper.columns


Index(['WORLDCONSUMPTION', 'COPPERPRICE', 'INCOMEINDEX', 'ALUMPRICE',
       'INVENTORYINDEX', 'TIME'],
      dtype='object')