In [1]:
%load_ext sql

In [2]:
%sql postgresql://postgres:Phuong*011195@localhost:5432/postgres

## Step 3: Process - Make it usable!
<p>When we start using the data, it might be a combination from different sources or it might not be of the highest quality. A process known as data cleaning is the fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
<p>The database provided to us, contains five tables with 2 tables: <code>info</code>, <code>brands</code> have some data fields need to clean for next step: <code>Analyst</code>

<p> Now, we gonna start with first table <code>info</code>:
<h3 id="info"><code>info</code></h3>
<table>
<thead>
<tr>
<th>column</th>
<th>data type</th>
<th>description</th>
<th>criteria</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>product_name</code></td>
<td><code>varchar</code></td>
<td>Name of the product</td>
<td>Nominal. All product names should be uppercased.</td>
</tr>
<tr>
<td><code>product_id</code></td>
<td><code>varchar</code></td>
<td>Unique ID for product</td>
<td>Nominal. The unique identifier of the hotel. Missing values are not possible due to the database structure.</td>
</tr>
<tr>
<td><code>description</code></td>
<td><code>varchar</code></td>
<td>Description of the product</td>
<td>Nominal. Keep the same format as original file.</td>
</tr>
</tbody>
</table>

<p>Now to match with criteria of <code>product_name</code> column, we define temporary tables with uppercase format by using <code>CTE (Common Table Expressions)</code> & <code>UPPER()</code> function:</p>

In [3]:
%%sql
WITH clean_info AS
    (SELECT
        UPPER(product_name) AS product_name,
        product_id,
        description
     FROM info)
SELECT *
FROM clean_info
LIMIT 5;

 * postgresql://postgres:***@localhost:5432/postgres
5 rows affected.


product_name,product_id,description
,AH2430,
WOMEN'S ADIDAS ORIGINALS SLEEK SHOES,G27341,"A modern take on adidas sport heritage, tailored just for women. Perforated 3-Stripes on the leather upper of these shoes offer a sleek look that mirrors iconic tennis styles."
WOMEN'S ADIDAS SWIM PUKA SLIPPERS,CM0081,These adidas Puka slippers for women's come with slim straps for a great fit. Feature performance logo on the footbed and textured Rubber outsole that gives unique comfort.
WOMEN'S ADIDAS SPORT INSPIRED QUESTAR RIDE SHOES,B44832,"Inspired by modern tech runners, these women's shoes step out with unexpected style. They're built with a breathable knit upper, while the heel offers the extra support of an Achilles-hugging design. The cushioned midsole provides a soft landing with every stride."
WOMEN'S ADIDAS ORIGINALS TAEKWONDO SHOES,D98205,"This design is inspired by vintage Taekwondo styles originally worn to perfect high kicks and rapid foot strikes. The canvas shoes make a streetwear fashion statement as a chic, foot-hugging slip-on. They're shaped for a narrow, women's-specific fit and ride on a soft gum rubber outsole."


<p><code>info</code> table is clean & match with its criteria now, let move to next table: <code>brands</code>. This table contains 2 columns with individual criterias as below:
<h3 id="brands"><code>brands</code></h3>
<table>
<thead>
<tr>
<th>column</th>
<th>data type</th>
<th>description</th>
<th>criteria</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>product_id</code></td>
<td><code>varchar</code></td>
<td>Unique ID for product</td>
<td>Nominal. The unique identifier of the hotel. Missing values are unaccepted.</td>
</tr>
<tr>
<td><code>brand</code></td>
<td><code>varchar</code></td>
<td>Brand of the product</td>
<td>Nominal. Name of product's brands. One of three possible values consist of 'Adidas', 'Nike' and null is accepted.
</tr>
</tbody>
</table>
<p>Now, we will check whether column <code>product_id</code> is unique or not by using <code>COUNT()</code> & <code>HAVING</code> clause:

In [6]:
%%sql
SELECT
    product_id,
    COUNT(product_id) AS count_id
FROM brands
GROUP BY product_id
HAVING COUNT(product_id) > 1;

 * postgresql://postgres:***@localhost:5432/postgres
1 rows affected.


product_id,count_id
CJ9585-600,2


<p>Oops! There is 1 product_id appears 2 times on <code>product_id</code> column, It's mean this coumn is not unique as criteria. To match with criteria of this column, we will use <code>Subquery SELECT DISTINCT</code> to remove duplicate rows from a result set returned by a query and put it in <code>FROM</code> statement.
<p>Not finish yet, we also use <code>ILIKE</code> operator to return result includes strings that are irrespective of the letter case and follow the mentioned pattern. In this table we have 'Adi' & 'adida' are misspelled and need to replace by 'Adidas' and 'N' will be'Nike':

In [7]:
%%sql
WITH clean_brands AS
    (SELECT
        product_id,
        CASE
            WHEN brand ILIKE 'Adi%' THEN 'Adidas'
            WHEN brand ILIKE 'N%' THEN 'Nike'
        END AS brand
    FROM
        (SELECT DISTINCT *
         FROM brands))

SELECT DISTINCT brand
FROM clean_brands;


 * postgresql://postgres:***@localhost:5432/postgres
3 rows affected.


brand
""
Adidas
Nike


<p>Great! <code>brand</code> column now only contain 3 values as its criteria & by using <code>Subquery SELECT DISTINCT</code> in <code>From</code> stament, we ensure that <code>product_id</code> column is unique now.
<p>All incorrect values have been transfered into correct form. Now our data set is ready for next step: <code>Analyst</code>.

## Step 4: Analyze - Tell me the story!