# Data Analysis: Generate Insights Like A Pro In 7 Steps
### Step 1: Understanding the business problem.
### Step 2: Analyze data requirements.
### Step 3: Data understanding and collection (Data Gathering)
### Step 4: Data Preparation (Data Transformation)
> Data is usually inclusive of missing values, inaccuracies, and other errors. Hence error correction, verifying the data quality, and joining the data sets together are a big part of the data preparation process.

#### The additional two steps of data preparation are:

1. Converting the collected data to a structured format with all required elements.
2. Cleaning it to remove unwanted substances.
3. Data Modelling (ERD)

### Step 5: Data visualization.
### Step 6: Data analysis.
### Step 7: Deployment.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json

# 1. Data Gathering, Transformation & Explanation

### Reading first .json format file from local

In [2]:
# Opening JSON file
f = open("/home/gaurav/Documents/Python/data/YouTube(IN)/IN_category_id.json")
  
# returns JSON object as a dictionary
data = json.load(f)

# Creating Dataframe of json data
df01 = pd.DataFrame(data)
df01.head(2)

Unnamed: 0,kind,etag,items
0,youtube#videoCategoryListResponse,kBCr3I9kLHHU79W4Ip5196LDptI,"{'kind': 'youtube#videoCategory', 'etag': 'IfW..."
1,youtube#videoCategoryListResponse,kBCr3I9kLHHU79W4Ip5196LDptI,"{'kind': 'youtube#videoCategory', 'etag': '5XG..."


### Creating Dataframe of flattened json data

In [66]:
df1 = pd.json_normalize(data,record_path=['items'])
df1.head(2)

Unnamed: 0,kind,etag,id,snippet.title,snippet.assignable,snippet.channelId
0,youtube#videoCategory,IfWa37JGcqZs-jZeAyFGkbeh6bc,1,Film & Animation,True,UCBR8-60-B28hp2BmDPdntcQ
1,youtube#videoCategory,5XGylIs7zkjHh5940dsT5862m1Y,2,Autos & Vehicles,True,UCBR8-60-B28hp2BmDPdntcQ


#### Next, Removing kind, etag, snippet.assignable, snippet.channelId  columns because of unwanted for analysis.
#### Let's store final usable data into category dataframe & rename also.

In [68]:
category = df1.drop(['kind', 'etag', 'snippet.assignable','snippet.channelId'], axis=1)

##### Renaming Columns

In [69]:
category.rename(columns = {'id':'Id','snippet.title':'Title'}, inplace = True)
category.head(2)

Unnamed: 0,Id,Title
0,1,Film & Animation
1,2,Autos & Vehicles


In [56]:
category.shape

(31, 2)

In [57]:
category.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Id      31 non-null     object
 1   Title   31 non-null     object
dtypes: object(2)
memory usage: 624.0+ bytes


In [58]:
category.describe()

Unnamed: 0,Id,Title
count,31,31
unique,31,30
top,1,Comedy
freq,1,2


In [59]:
category.nunique()

Id       31
Title    30
dtype: int64

In [60]:
category['Title'].unique()

array(['Film & Animation', 'Autos & Vehicles', 'Music', 'Pets & Animals',
       'Sports', 'Short Movies', 'Travel & Events', 'Gaming',
       'Videoblogging', 'People & Blogs', 'Comedy', 'Entertainment',
       'News & Politics', 'Howto & Style', 'Education',
       'Science & Technology', 'Movies', 'Anime/Animation',
       'Action/Adventure', 'Classics', 'Documentary', 'Drama', 'Family',
       'Foreign', 'Horror', 'Sci-Fi/Fantasy', 'Thriller', 'Shorts',
       'Shows', 'Trailers'], dtype=object)

In [63]:
category.duplicated().sum()

0

### Reading Second .csv format file from local

In [64]:
df2 = pd.read_csv("/home/gaurav/Documents/Python/data/YouTube(IN)/IN_youtube_trending_data.csv") 

In [65]:
df2.head(2)

Unnamed: 0,video_id,title,publishedAt,channelId,channelTitle,categoryId,trending_date,tags,view_count,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,description
0,Iot0eF6EoNA,Sadak 2 | Official Trailer | Sanjay | Pooja | ...,2020-08-12T04:31:41Z,UCGqvJPRcv7aVFun-eTsatcA,FoxStarHindi,24,2020-08-12T00:00:00Z,sadak|sadak 2|mahesh bhatt|vishesh films|pooja...,9885899,224925,3979409,350210,https://i.ytimg.com/vi/Iot0eF6EoNA/default.jpg,False,False,Three Streams. Three Stories. One Journey. Sta...
1,x-KbnJ9fvJc,Kya Baat Aa : Karan Aujla (Official Video) Tan...,2020-08-11T09:00:11Z,UCm9SZAl03Rev9sFwloCdz1g,Rehaan Records,10,2020-08-12T00:00:00Z,[None],11308046,655450,33242,405146,https://i.ytimg.com/vi/x-KbnJ9fvJc/default.jpg,False,False,Singer/Lyrics: Karan Aujla Feat Tania Music/ D...


In [None]:
#df1['items'].values.tolist()

In [None]:
df2.info()

In [None]:
df2['publishedAt'].min()

In [None]:
df2['publishedAt'].max()

In [None]:
df2.describe()

In [None]:
df2["categoryId"].nunique()

In [None]:
df2["categoryId"].unique()

In [None]:
df2.shape

In [None]:
import pandas as pd

data1 = [
	{"Roll no": 1,
	"student": {"first_name": "Ram", "last_name": "kumar"}
	},
	{"student": {"English": "95", "Math": "88"}
	},
	{"Roll no": 2,
	"student": {"first_name": "Joseph", "English": "90", "Science": "82"}
	},
	{"Roll no": 3,
	"student": {"first_name": "abinaya", "last_name": "devi"},
	"student": {"English": "91", "Math": "98"}
	},
]

pd.DataFrame(data1)

In [None]:
pd.json_normalize(data1)

In [None]:
pd.json_normalize(data1, max_level=1)

In [None]:
import pandas as pd
data2 = [
	{
		"company": "Google",
		"tagline": "Dont be evil",
		"management": {"CEO": "Sundar Pichai"},
		"department": [
			{"name": "Gmail", "revenue (bn)": 123},
			{"name": "GCP", "revenue (bn)": 400},
			{"name": "Google drive", "revenue (bn)": 600},
		],
	},
	{
		"company": "Microsoft",
		"tagline": "Be What's Next",
		"management": {"CEO": "Satya Nadella"},
		"department": [
			{"name": "Onedrive", "revenue (bn)": 13},
			{"name": "Azure", "revenue (bn)": 300},
			{"name": "Microsoft 365", "revenue (bn)": 300},
		],
	},

]
result = pd.json_normalize(
	data2, "department", ["company", "tagline", ["management", "CEO"]]
)
result

In [None]:
pd.DataFrame(data2)

In [None]:
test = {
    "kind": "youtube#videoCategoryListResponse",
    "etag": "kBCr3I9kLHHU79W4Ip5196LDptI",
    "items": [
        {
            "kind": "youtube#videoCategory",
            "etag": "IfWa37JGcqZs-jZeAyFGkbeh6bc",
            "id": "1",
            "snippet": {
                "title": "Film & Animation",
                "assignable": 'true',
                "channelId": "UCBR8-60-B28hp2BmDPdntcQ"
              
            }
        }
    ]
}

test_df = pd.DataFrame(test)
test_df

In [None]:
pd.json_normalize(test)

In [None]:
# Flatten items
pd.json_normalize(test, record_path=['items'])