# Data Cleaning with Pandas - products.csv

In [1]:
import pandas as pd

In [4]:
#brands
url = "https://drive.google.com/file/d/1m1ThDDIYRTTii-rqM5SEQjJ8McidJskD/view?usp=sharing" 
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
brands = pd.read_csv(path)

## 1. Copy the Findings from Data Exploration of brands.csv:

### A)&nbsp; Check .info() 
.shape() and .info() and .duplicated().sum() and .isna().sum()

**INFO:**
- 187 rows and two columns: `short` and `long` 
- `short` has 187 different brands and no duplicates
- `long` has 181 different brands so there are 6 duplicates.
- There are no missing values
- object-types seem ok

### B)&nbsp; Raw Data

**INFO:**
- `short` is a 3 digit short acronym, which can have numbers in it (8MO)
- `long`s values are not unique: Apple, Apple, and Bose, Bose. 
- The second entry for Bose has the acronym CAD - which is not very close
- There are blancs and dashes in the longnames

### C) Numerical Columns

**INFO:**
- There are no numeric columns in this dataset.



### D) Categorical Columns

**INFO**
-  so far there is no categorical variable such as status or anything with just a few groups.

## 2.&nbsp; Remove Duplicates
We can check for duplicates using the pandas [.duplicated()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) method. 

We can then delete these rows, if we wish, using [.drop_duplicates()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)

In [5]:
# products
brands.duplicated().sum()

0

In [49]:
# showing me the duplicated rows
# products.loc[products.sku.duplicated(), :]

There are no duplicates.

In [5]:
# remove duplicates permanently
#products = products.drop_duplicates()

## 3.&nbsp; Deal with Missing Values

### 3.1.&nbsp; Products
* There are no missing values


## 4.&nbsp; Correct Datatypes

### 4.1.&nbsp; Intro
How to convert to_datetime() and to_numeric()

In [10]:
orders["created_date"] = pd.to_datetime(orders["created_date"])
orderlines["unit_price"] = pd.to_numeric(orderlines["unit_price"])

In [13]:
orders.info() # shows the changed datatype

Some thinking and analysis and **final decision** on how to proceed:
* [ ] exclude from analysis
* [ ] asking the products to be checked by experts of prices
* [ ] further analysing the impact of these products on overall revenue
* [ ] looking at the raw data and trying to figure out a solution

## 5. Safe File(s)
Now save the cleaned dataframes as csv files:


In [8]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html?highlight=to_csv

brands.to_csv("brands_cl.csv", index=False)

### Download from colab:
#from google.colab import files
#files.download("products_cl.csv")