# **<font color='#0969DA'>Guided Lab 343.3.4 - Exploratory Data Analysis on json data - Basic insights from the Data</font>**
---

## **Lab Overview:**

This lab focuses on performing Exploratory Data Analysis (EDA) on a JSON dataset using Python and the Pandas library. The lab aims to guide you through the following key concepts:

1. **Data Type Inspection:** Understanding the importance of checking data types for potential mismatches and compatibility with Python methods. This is demonstrated using the `dtypes` attribute of Pandas DataFrames.

2. **Descriptive Statistics:** Calculating and interpreting basic statistical measures such as mean, standard deviation, minimum, maximum, and quartiles using the `describe()` method.

3. **Concise Summary:** Obtaining a comprehensive overview of the dataset, including column names, data types, memory usage, and non-null values, using the `info()` method.

4. **Data Selection:** Extracting specific records or subsets of the data using the `head()`, `tail()`, `at`, and `iat` functions, enabling efficient exploration of large datasets.

5. **Data Shape and Size:** Determining the number of rows and columns using the `shape` attribute and exploring alternative methods like `axes` and `len` to access this information.

**Learning Outcomes:**

By the end of this lab, you should be able to:

* Confidently load and manipulate JSON data in Python using Pandas.
* Utilize various Pandas functions to perform basic EDA tasks.
* Interpret descriptive statistics and summaries to gain insights into data.
* Efficiently extract and analyze specific subsets of data.
* Understand the structure and dimensions of a dataset.

**Dataset:**

The lab utilizes a JSON dataset named ['cars.json'](https://drive.google.com/file/d/1CXAK8gbuLtc2NNOXVUgmja8fDg0TrNZm/view) as the primary data source for analysis and demonstration.


### **<font color='#0969DA'>How to check Data types in Pandas**



- In pandas, we use **dtypes** attribute to check data types.
- Why check data types?
 - potential info and type mismatch.
 - compatibility with python methods.
---
# **Begin**

The lab follows a step-by-step approach, starting with loading the JSON data into a Pandas DataFrame. It then proceeds with exploring the data's characteristics, calculating statistics, selecting specific records, and understanding the dataset's structure.

In [1]:
import pandas as pd

In [2]:
# Read JSON file
df_cars = pd.read_json('./Data/cars.json')
print(df_cars.dtypes) # check the underlying data types

Car              object
MPG             float64
Cylinders         int64
Displacement    float64
Horsepower        int64
Weight            int64
Acceleration    float64
Model             int64
Origin           object
quantity          int64
city             object
dtype: object


## **<font color='#0969DA'>Determining Descriptive Statistics**

- Pandas provides many statistical methods for DataFrames. You can get basic statistics summary for the numerical columns of a Pandas DataFrame with **describe()** method.

Visit this link for all descriptive related methods.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats

- Example: Consider the **cars.json** dataset

In [6]:

df_cars = pd.read_json('./Data/cars.json')
df_cars.describe()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model,quantity
count,161.0,161.0,161.0,161.0,161.0,161.0,161.0,161.0
mean,23.801863,5.347826,185.232919,100.664596,2915.093168,15.509938,76.26087,224.875776
std,8.810125,1.761607,105.394809,41.07964,890.293883,2.51578,3.818576,127.741084
min,0.0,3.0,68.0,0.0,1613.0,8.0,70.0,5.0
25%,17.0,4.0,98.0,72.0,2130.0,14.0,73.0,112.0
50%,24.0,4.0,140.0,88.0,2625.0,15.5,76.0,227.0
75%,31.0,8.0,302.0,130.0,3620.0,17.1,80.0,337.0
max,46.6,8.0,440.0,215.0,4955.0,22.1,82.0,439.0


in the above result, describe() returns a new DataFrame with the number of rows indicated by count, as well as the mean, standard deviation, minimum, maximum, and quartiles of the columns.

---



## **<font color='#0969DA'>Determine Basic Concise summary</font>**

Pandas provides many statistical methods for DataFrames. You can get basic concise summary for the Pandas DataFrame with **info()** method.

In other words, info function gives metadata of panda DataFrame, Which includes,

- Number of rows and its range of index
- Total number of columns
- List of columns
- Count of the total number of non-null values in the column
- Data type of column
- Count of columns in each data type
- Memory usage by the DataFrame

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

# **<font color='#0969DA'>DataFrame Count</font>**

df.count():
DataFrame Count will return the number of Non-NA values within each column. I don’t love this one because 1) it’s slower and 2) you need to do extra data work after you call .count().

Be careful, if you have NAs in your dataset, it can get confusing. The count() will skip these by default.

In [7]:
df_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Car           161 non-null    object 
 1   MPG           161 non-null    float64
 2   Cylinders     161 non-null    int64  
 3   Displacement  161 non-null    float64
 4   Horsepower    161 non-null    int64  
 5   Weight        161 non-null    int64  
 6   Acceleration  161 non-null    float64
 7   Model         161 non-null    int64  
 8   Origin        161 non-null    object 
 9   quantity      161 non-null    int64  
 10  city          161 non-null    object 
dtypes: float64(3), int64(5), object(3)
memory usage: 14.0+ KB


In the above result, the information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

---



## **<font color='#0969DA'>Select few records</font>**

The **head()** and  **tail()** functions use to select top and bottom rows of the Pandas DataFrame respectively. It is beneficial when we have massive datasets, and it is not possible to see the entire dataset at once.

**Example: Consider the cars.json dataset**

You can use **head(2)** function, only the first 2 rows of the DataFrame are displayed.

In [None]:
print(df_cars.head(2))
print(df_cars.head(-4)) # All rows minus the last 4

                   Car   MPG  Cylinders  Displacement  Horsepower  Weight  \
0       Chevrolet Vega  25.0          4         140.0          75    2542   
1  Chevrolet Vega (sw)  22.0          4         140.0          72    2408   

   Acceleration  Model Origin  quantity    city  
0          17.0     74     US       177      NJ  
1          19.0     71     US        91  DALLAS  
                            Car   MPG  Cylinders  Displacement  Horsepower  \
0                Chevrolet Vega  25.0          4         140.0          75   
1           Chevrolet Vega (sw)  22.0          4         140.0          72   
2           Chevrolet Vega 2300  28.0          4         140.0          90   
3               Chevrolet Woody  24.5          4          98.0          60   
4    Chevrolete Chevelle Malibu  16.0          6         250.0         105   
..                          ...   ...        ...           ...         ...   
152          Mercedes Benz 300d  25.4          5         183.0          



---



You can use **tail(2)** function, only the last 2 rows of the DataFrame are displayed.

In [None]:
print(df_cars.tail(2))
print(df_cars.tail(-20)) # Start at row 20

                 Car   MPG  Cylinders  Displacement  Horsepower  Weight  \
159   Mercury Lynx l  36.0          4          98.0          70    2125   
160  Mercury Marquis  11.0          8         429.0         208    4633   

     Acceleration  Model Origin  quantity   city  
159          17.3     82     US       425  TEXAS  
160          11.0     72     US       112     OH  
                         Car   MPG  Cylinders  Displacement  Horsepower  \
20             Datsun 280-ZX  32.7          6         168.0         132   
21                Datsun 310  37.2          4          86.0          65   
22             Datsun 310 GX  38.0          4          91.0          67   
23                Datsun 510  27.2          4         119.0          97   
24           Datsun 510 (sw)  28.0          4          97.0          92   
..                       ...   ...        ...           ...         ...   
156         Mercury Capri v6  21.0          6         155.0         107   
157  Mercury Cougar B



---



## **<font color='#0969DA'>Select Specific records</font>**

 Also, **at** and **iat** properties to access a specific element in the DataFrame.

Example: Using **at** property:

**Consider the cars.json dataset**



In [15]:
print(df_cars.at[157, 'MPG'])
print(df_cars.at[20, 'MPG'])


15.0
32.7


**DataFrame.iat:** We want to access a specific element from a very large DataFrame, but we do not know its column label or row index. We can still access such an element using its column and row positions. For that, we can use iat property of python pandas.

**Example: Using iat property:**
In this example, we will access the 157 row and the 1st column.

In [16]:
df_cars.iat[157, 1]

np.float64(15.0)



---



# **<font color='#0969DA'>DataFrame Shape</font>**
## **Find number of rows and columns**
The number of rows and columns of a DataFrame can be identified using the .**shape ** attribute of the Panda DataFrame. It returns a tuple (row, column) and can be indexed to get only rows, and only columns count as output.


**- df.shape[0] - To count rows**

**- df.shape[1] - To count columns**

In [17]:

print(df_cars.shape) # Get the number of rows and columns
print(df_cars.shape[0]) # Get the number of rows only
print(df_cars.shape[1]) # Get the number of columns only

(161, 11)
161
11


In [3]:
# Create DataFrame from dict
student_dict = {'Name': ['Joe', 'Nat', 'Harry'], 'Age': [20, 21, 19], 'Marks': [85.10, 77.80, 91.54]}

student_df = pd.DataFrame(student_dict)

list_Index = student_df.columns    # get col index
print(list_Index)
label = student_df.columns[0]  # 1st col label
print(label)
print(student_df.columns[0])
Get_As_List = student_df.columns.tolist() # get as a list
print(Get_As_List)

Index(['Name', 'Age', 'Marks'], dtype='object')
Name
Name
['Name', 'Age', 'Marks']




---

# **<font color='#0969DA'>DataFrame Axes Length**</font>

**len(df.axes[0]):** Next up is our most verbose option – DataFrame Axes Length.

This axes attribute will return your row axis, then you must count the length of it.

Let’s break this one down. **df.axes** will return a tuple of your two axes for rows and columns. [0] will pull the first item (rows) from the tuple. Then finally **len()** will find the length, or how many items, you have in your axis which is your row count.

 Let's look through it step by step.

- Return both axis (rows/columns)

- Pull our the rows

- Count the length

In [19]:
df_cars.axes

[RangeIndex(start=0, stop=161, step=1),
 Index(['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
        'Acceleration', 'Model', 'Origin', 'quantity', 'city'],
       dtype='object')]

In [20]:
df_cars.axes[0]

RangeIndex(start=0, stop=161, step=1)

In [6]:
df_cars.axes[1]

Index(['Car', 'MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
       'Acceleration', 'Model', 'Origin', 'quantity', 'city'],
      dtype='object')

In [11]:
df_cars.axes[1][1]

'MPG'

In [21]:
len(df_cars.axes[0])

161

In [4]:
len(df_cars.axes[1])

11

## **Submission**
- Submit your completed lab using the Start Assignment button on the assignment page in Canvas.
- Your submission can be include:
  - if you are using notebook then, all tasks should be written and submitted in a single notebook file, for example: (**your_name_labname.ipynb**).
  - if you are using python script file, all tasks should be written and submitted in a single python script file for example: **(your_name_labname.py)**.
- Add appropriate comments and any additional instructions if required.
