# Spring Cleaning!

Harold's stock data is a mess! Help him clean up his data before the auditors arrive!

In [1]:
# Initial imports
import pandas as pd
from pathlib import Path

### Load CSV data into Pandas using `read_csv`

In [2]:
csv_path = Path("../Resources/stock_data.csv")
csv_data = pd.read_csv(csv_path)

### Identify the number of rows and columns (shape) in the DataFrame.

In [3]:
csv_data.shape

(504, 14)

### Preview the DataFrame using `head` to visually ensure data has been loaded in correctly.

In [4]:
csv_data.head()

Unnamed: 0,symbol,name,sector,price,price_per_earnings,dividend_yield,earnings_per_share,52_week_low,52_week_high,market_cap,ebitda,price_per_sales,price_per_book,sec_filings
0,MMM,3M Company,Industrials,$222.89,24.31,2.332862,$7.92,259.77,175.49,138721100000.0,9048000000.0,4.390271,11.34,http://www.sec.gov/cgi-bin/browse-edgar?action...
1,AOS,A.O. Smith Corp,Industrials,,,,,,,,,,,
2,ABT,Abbott Laboratories,Health Care,56.27,22.51,1.908982,0.26,64.6,42.28,102121000000.0,5744000000.0,3.74048,3.19,http://www.sec.gov/cgi-bin/browse-edgar?action...
3,ABBV,AbbVie Inc.,Health Care,108.48,19.41,2.49956,3.29,125.86,60.05,181386300000.0,10310000000.0,6.291571,26.14,http://www.sec.gov/cgi-bin/browse-edgar?action...
4,ATVI,Activision Blizzard,Information Technology,65.83,,0.431903,1.28,74.945,38.93,52518670000.0,2704000000.0,10.59512,5.16,http://www.sec.gov/cgi-bin/browse-edgar?action...


### Identify the number of records in the DataFrame, and compare it with the number of rows in the original file.

In [5]:
csv_data.count()

symbol                504
name                  502
sector                501
price                 500
price_per_earnings    497
dividend_yield        499
earnings_per_share    498
52_week_low           500
52_week_high          500
market_cap            500
ebitda                492
price_per_sales       500
price_per_book        492
sec_filings           500
dtype: int64

### Identify null records

In [6]:
csv_data.isnull().mean() * 100

symbol                0.000000
name                  0.396825
sector                0.595238
price                 0.793651
price_per_earnings    1.388889
dividend_yield        0.992063
earnings_per_share    1.190476
52_week_low           0.793651
52_week_high          0.793651
market_cap            0.793651
ebitda                2.380952
price_per_sales       0.793651
price_per_book        2.380952
sec_filings           0.793651
dtype: float64

### Drop Null Records

In [7]:
csv_data = csv_data.dropna().copy()

### Validate nulls have been dropped

In [8]:
csv_data.isnull().sum()

symbol                0
name                  0
sector                0
price                 0
price_per_earnings    0
dividend_yield        0
earnings_per_share    0
52_week_low           0
52_week_high          0
market_cap            0
ebitda                0
price_per_sales       0
price_per_book        0
sec_filings           0
dtype: int64

### Default null `ebitda` values to 0. Then, validate no records are null for `ebitda`.

In [9]:
csv_data["ebitda"] = csv_data["ebitda"].fillna(0)
csv_data["ebitda"].isnull().sum()

0

### Drop Duplicates

In [10]:
csv_data = csv_data.drop_duplicates().copy()

---

### Challenge

#### Preview `price` field using the `head` function.

In [11]:
csv_data["price"].head(10)

0     $222.89
2       56.27
3      108.48
5      108.48
6      185.16
7      109.63
10        178
11     179.11
14      152.8
15      62.49
Name: price, dtype: object

#### Clean `price` Series by replacing `$`

In [12]:
csv_data["price"] = csv_data["price"].str.replace("$", "")
csv_data["price"].head(10)

0     222.89
2      56.27
3     108.48
5     108.48
6     185.16
7     109.63
10       178
11    179.11
14     152.8
15     62.49
Name: price, dtype: object

#### Confirm data type of `price`

In [13]:
csv_data["price"].dtype

dtype('O')

#### Cast `price` Series as float and then validate using `dtype`

In [14]:
csv_data["price"] = csv_data["price"].astype('float')
csv_data["price"].dtype

dtype('float64')