# Data Wrangling

Perform the following:

1. With 'faang' dataset, use type conversion to change the date column into a datetime and the volume column into integers. Then, sort by date and ticker.

2. Find the seven rows with the highest value for volume.

3. Right now the data is somewhere between long and wide format. Use melt() to make it a completely long format. Hint: date and ticker are our ID variables (they uniquely identify each row). We need to melt the rest so that we don't have separate columns for open, high, low, close, and volume.

4. Suppose we found out there was a glitch in how the data was recorded on July 26, 2018. How should we handle this? Note that there is no coding required for this exercise.

 

Submit your notebook in pdf file format only

## Imports

In [1]:
import numpy as np 
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
df = pd.read_csv('Datasets/faang.csv', encoding = "ISO-8859-1")

In [2]:
df.dtypes

ticker     object
date       object
open      float64
high      float64
low       float64
close     float64
volume      int64
dtype: object

#### 1. Sort by date and ticker

In [3]:
df = df.sort_values(by=['date', 'ticker'])
df

Unnamed: 0,ticker,date,open,high,low,close,volume
251,AAPL,2018-01-02,166.9271,169.0264,166.0442,168.9872,25555934
502,AMZN,2018-01-02,1172.0000,1190.0000,1170.5100,1189.0100,2694494
0,FB,2018-01-02,177.6800,181.5800,177.5500,181.4200,18151903
1004,GOOG,2018-01-02,1048.3400,1066.9400,1045.2300,1065.0000,1237564
753,NFLX,2018-01-02,196.1000,201.6500,195.4200,201.0700,10966889
...,...,...,...,...,...,...,...
501,AAPL,2018-12-31,157.8529,158.6794,155.8117,157.0663,35003466
752,AMZN,2018-12-31,1510.8000,1520.7600,1487.0000,1501.9700,6954507
250,FB,2018-12-31,134.4500,134.6400,129.9500,131.0900,24625308
1254,GOOG,2018-12-31,1050.9600,1052.7000,1023.5900,1035.6100,1493722


#### 2. Finding the seven rows with the highest value for volume.

In [4]:
top_volume = df.nlargest(7, 'volume')
top_volume

Unnamed: 0,ticker,date,open,high,low,close,volume
142,FB,2018-07-26,174.89,180.13,173.75,176.26,169803668
53,FB,2018-03-20,167.47,170.2,161.95,168.15,129851768
57,FB,2018-03-26,160.82,161.1,149.02,160.06,126116634
54,FB,2018-03-21,164.8,173.4,163.3,169.39,106598834
433,AAPL,2018-09-21,219.0727,219.6482,215.6097,215.9768,96246748
496,AAPL,2018-12-21,156.1901,157.4845,148.9909,150.0862,95744384
463,AAPL,2018-11-02,207.9295,211.9978,203.8414,205.8755,91328654


#### 3. Melting the data

In [5]:
melted_df = pd.melt(df, id_vars=['date', 'ticker'], value_vars=['open', 'high', 'low', 'close', 'volume'])
melted_df.head()

Unnamed: 0,date,ticker,variable,value
0,2018-01-02,AAPL,open,166.9271
1,2018-01-02,AMZN,open,1172.0
2,2018-01-02,FB,open,177.68
3,2018-01-02,GOOG,open,1048.34
4,2018-01-02,NFLX,open,196.1


#### 4. Suppose we found out there was a glitch in how the data was recorded on July 26, 2018. How should we handle this?

**Answer:** Since the date is not available, we could utilize the ticker symbol, and other parameters like the market price and volume depending on the availability of other parameters. As stated, the data is corrupted/glitched we can also find in the data set if there is any NaN value in the date column.  