### Data taken from kaggle
Source - https://www.kaggle.com/ashishjangra27/geeksforgeeks-articles

Importing Required modules

In [158]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from collections import Counter

Import data and making a copy of it

In [2]:
original_df=pd.read_csv('articles.csv')
#Make a copy of original data
df=original_df.copy()

In [3]:
df.head()

Unnamed: 0,title,author_id,last_updated,link,category
0,5 Best Practices For Writing SQL Joins,priyankab14,"21 Feb, 2022",https://www.geeksforgeeks.org/5-best-practices...,easy
1,Foundation CSS Dropdown Menu,ishankhandelwals,"20 Feb, 2022",https://www.geeksforgeeks.org/foundation-css-d...,easy
2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,"17 Feb, 2022",https://www.geeksforgeeks.org/top-20-excel-sho...,easy
3,Servlet – Fetching Result,nishatiwari1719,"17 Feb, 2022",https://www.geeksforgeeks.org/servlet-fetching...,easy
4,Suffix Sum Array,rohit768,"21 Feb, 2022",https://www.geeksforgeeks.org/suffix-sum-array/,easy


In [4]:
#Checking dtypes
df.dtypes

title           object
author_id       object
last_updated    object
link            object
category        object
dtype: object

Last updated column is date, lets convert it into date format

In [5]:
df.isna().sum()

title            0
author_id       19
last_updated    18
link             0
category         0
dtype: int64

There are some null values in author_id and last_updated column. Lets just ignore that records because they are only few

In [6]:
df=df.dropna()

In [7]:
df.last_updated.sample(10)

27830    28 Jun, 2021
5258     09 Jun, 2020
781      09 Feb, 2022
33917    19 Feb, 2020
9581     05 Nov, 2020
11052    30 Sep, 2020
19482    17 Jun, 2021
5395     31 May, 2021
9185     03 Jul, 2020
18322    13 Sep, 2021
Name: last_updated, dtype: object

In [8]:
df['category'].unique()

array(['easy', 'basic', 'medium', 'hard', 'expert'], dtype=object)

In [9]:
pd.to_datetime(df['last_updated'],format='%d %b, %Y')

ValueError: time data 'Medium' does not match format '%d %b, %Y' (match)

There are some other values in this column instead of dates, here we can 'Medium' is present, lets check for other values

In [10]:
print('easy' in list(df.last_updated))
print('Easy' in list(df.last_updated))
print('Medium' in list(df.last_updated))
print('medium' in list(df.last_updated))
print('basic' in list(df.last_updated))
print('Basic' in list(df.last_updated))
print('Hard' in list(df.last_updated))
print('hard' in list(df.last_updated))
print('expert' in list(df.last_updated))
print('Expert' in list(df.last_updated))

False
True
True
False
False
True
True
False
False
False


Easy, Medium, Basic, Hard are present in column instead of dates. Lets remove those values and convert remaining data into dates

In [11]:
#Here, if it in date format, first character is numeric.
#Lets check df with incorrect date format
df[df['last_updated'].str[:1].str.isnumeric()==False]

Unnamed: 0,title,author_id,last_updated,link,category
20,Must Do Coding Questions for Product Based Com...,GeeksforGeeks,Medium,https://www.geeksforgeeks.org/must-do-coding-q...,easy
152,Get Hired With GeeksforGeeks and Win Exciting ...,GeeksforGeeks,Easy,https://www.geeksforgeeks.org/get-hired-with-g...,easy
617,Recently Asked Interview Questions in Product ...,GeeksforGeeks,Medium,https://www.geeksforgeeks.org/recently-asked-i...,easy
654,100 Days of Code – A Complete Guide For Beginn...,anuupadhyay,Medium,https://www.geeksforgeeks.org/100-days-of-code...,easy
711,Be a Part of GeeksforGeeks YouTube World – Mor...,GeeksforGeeks,Medium,https://www.geeksforgeeks.org/be-a-part-of-gee...,easy
...,...,...,...,...,...
31290,Brent’s Cycle Detection Algorithm,Surya Priy,Hard,https://www.geeksforgeeks.org/brents-cycle-det...,hard
31348,Sparse Table,pawan_asipu,Hard,https://www.geeksforgeeks.org/sparse-table/,hard
31567,Number Theory (Interesting Facts and Algorithms),GeeksforGeeks,Hard,https://www.geeksforgeeks.org/number-theory-in...,hard
31570,Magic Square | Even Order,GeeksforGeeks,Hard,https://www.geeksforgeeks.org/magic-square-eve...,hard


There are 96 records in the data with incorrect format. Lets ignore that data, and perform eda on remaining data

In [12]:
df=df[df['last_updated'].str[0].str.isnumeric()]

 Change data type of 'last_updated' column from string to date format

In [13]:
df['last_updated']=pd.to_datetime(df['last_updated'],format='%d %b, %Y')

In [14]:
df.head()

Unnamed: 0,title,author_id,last_updated,link,category
0,5 Best Practices For Writing SQL Joins,priyankab14,2022-02-21,https://www.geeksforgeeks.org/5-best-practices...,easy
1,Foundation CSS Dropdown Menu,ishankhandelwals,2022-02-20,https://www.geeksforgeeks.org/foundation-css-d...,easy
2,Top 20 Excel Shortcuts That You Need To Know,priyankab14,2022-02-17,https://www.geeksforgeeks.org/top-20-excel-sho...,easy
3,Servlet – Fetching Result,nishatiwari1719,2022-02-17,https://www.geeksforgeeks.org/servlet-fetching...,easy
4,Suffix Sum Array,rohit768,2022-02-21,https://www.geeksforgeeks.org/suffix-sum-array/,easy


In [15]:
df.dtypes

title                   object
author_id               object
last_updated    datetime64[ns]
link                    object
category                object
dtype: object

We successfully removed dirty or incorrect data and converted last_updated dtype from object to date format.

# Data Analysis

## 1. Most popular author (in terms of number of articles contributed)

In [25]:
len(df.author_id.unique())

5583

There are 5583 authors who contributed to gfg. Lets find the top 10 who contributed the most.

In [22]:
df.author_id.value_counts()[:10]

GeeksforGeeks          11932
ManasChhabra2            317
Striver                  261
manjeet_04               246
Chinmoy Lenka            191
pawan_asipu              155
sarthak_ishu11           151
anuupadhyay              146
Shubrodeep Banerjee      143
ankita_saini             125
Name: author_id, dtype: int64

Above are the top 10 authors who contributed the most

In [35]:
t=df.author_id.value_counts()[:10]
px.bar(t,x=t.index,y=t.values)

## 2.Category column

In [53]:
df.category.value_counts()

medium    10409
easy       9602
basic      8127
hard       4223
expert     2094
Name: category, dtype: int64

In [65]:
px.histogram(df,x='category',range_y=[2000,11000])

## 3. Popular authors according to each category

In [67]:
df['category'].unique()

array(['easy', 'basic', 'medium', 'hard', 'expert'], dtype=object)

##### Above are the type of categories present, lets filter the data on each category and find popular authors in that category

In [119]:
df[df['category']=='basic'].author_id.value_counts()[:5]

GeeksforGeeks          2291
ManasChhabra2           293
Shubrodeep Banerjee     112
Chinmoy Lenka            98
manjeet_04               91
Name: author_id, dtype: int64

In [120]:
df[df['category']=='easy'].author_id.value_counts()[:5]

GeeksforGeeks     3814
sarthak_ishu11      77
manjeet_04          73
Striver             70
Chinmoy Lenka       55
Name: author_id, dtype: int64

In [121]:
df[df['category']=='medium'].author_id.value_counts()[:5]

GeeksforGeeks    3975
Striver            90
pawan_asipu        53
manjeet_04         49
anuupadhyay        48
Name: author_id, dtype: int64

In [122]:
df[df['category']=='hard'].author_id.value_counts()[:5]

GeeksforGeeks        1567
pawan_asipu            26
Striver                25
DivyanshuShekhar1      25
priyavermaa1198        21
Name: author_id, dtype: int64

In [123]:
df[df['category']=='expert'].author_id.value_counts()[:5]

GeeksforGeeks      285
mishrapriyank17     36
pintusaini          34
harkiran78          24
zack_aayush         22
Name: author_id, dtype: int64

# 4. Counting articles of popular authors according to each category 

In [98]:
pop_authors=set(df.author_id.value_counts()[1:10].index)
pop_authors

{'Chinmoy Lenka',
 'ManasChhabra2',
 'Shubrodeep Banerjee',
 'Striver',
 'ankita_saini',
 'anuupadhyay',
 'manjeet_04',
 'pawan_asipu',
 'sarthak_ishu11'}

As gfg itself has lot of contributions, I ignored that so that we can focus more on other people

Among above popular authors, lets count number of articles of each category

In [99]:
pop_authors_df=df[df['author_id'].isin(pop_authors)]
pop_authors_df

Unnamed: 0,title,author_id,last_updated,link,category
287,How many types of number systems are there?,ManasChhabra2,2021-09-21,https://www.geeksforgeeks.org/how-many-types-o...,easy
429,Python program to a Sort Matrix by index-value...,manjeet_04,2021-07-18,https://www.geeksforgeeks.org/python-program-t...,easy
431,Python – Find the difference of the sum of lis...,manjeet_04,2021-07-18,https://www.geeksforgeeks.org/python-find-the-...,easy
474,Documenting Flask Endpoint using Flask-Autodoc,manjeet_04,2021-07-04,https://www.geeksforgeeks.org/documenting-flas...,easy
494,Python – Get word frequency in percentage,manjeet_04,2021-06-30,https://www.geeksforgeeks.org/python-get-word-...,easy
...,...,...,...,...,...
34386,PHP | sqrt( ) Function,Shubrodeep Banerjee,2020-11-23,https://www.geeksforgeeks.org/php-sqrt-function/,expert
34389,PHP | Sessions,Shubrodeep Banerjee,2019-02-12,https://www.geeksforgeeks.org/php-sessions/,expert
34391,Python string | isdecimal(),Striver,2018-01-05,https://www.geeksforgeeks.org/python-string-is...,expert
34396,Python | Maximum sum of elements of list in a ...,Striver,2018-11-21,https://www.geeksforgeeks.org/python-maximum-s...,expert


In [124]:
pop_authors_df[['author_id','category','title']].groupby(['author_id','category']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,title
author_id,category,Unnamed: 2_level_1
Chinmoy Lenka,basic,98
Chinmoy Lenka,easy,55
Chinmoy Lenka,expert,3
Chinmoy Lenka,hard,6
Chinmoy Lenka,medium,29
ManasChhabra2,basic,293
ManasChhabra2,easy,6
ManasChhabra2,expert,3
ManasChhabra2,hard,2
ManasChhabra2,medium,13


In [100]:
px.histogram(pop_authors_df,x='author_id',color='category')

## 4. Analysis on date column

In [128]:
df['last_updated'].value_counts()

2021-06-28    1285
2021-11-24     155
2021-05-21     151
2021-06-30     151
2021-08-11     140
              ... 
2017-01-06       1
2016-06-07       1
2016-07-10       1
2017-05-04       1
2015-05-12       1
Name: last_updated, Length: 1958, dtype: int64

#### Lets focus on year

As this data doesn't include all the 2022 year data, lets ignore that year

In [138]:
df[df['last_updated'].dt.year!=2022]['last_updated'].dt.year.value_counts().sort_index()

2010        1
2011        1
2012        5
2013       70
2014       53
2015      172
2016      200
2017     1021
2018     2522
2019     3985
2020     4625
2021    18613
Name: last_updated, dtype: int64

In [139]:
px.histogram(df[df['last_updated'].dt.year!=2022]['last_updated'].dt.year)

### There are lot of articles updated in 2021 which is fishy. It is thrice as previous years.
I think some data is missing of other years or if it is not, then lot of articles updated in 2021 in definitely an interesting point

#### Lets focus on month

In [140]:
df[df['last_updated'].dt.year!=2022]['last_updated'].dt.month.value_counts().sort_index()

1     1334
2     1401
3     1510
4     2318
5     3331
6     4060
7     2823
8     3192
9     2918
10    2775
11    2932
12    2674
Name: last_updated, dtype: int64

In [141]:
px.histogram(df[df['last_updated'].dt.year!=2022]['last_updated'].dt.month)

##### More articles were updated in june month.
It may be because of summer holidays or more free time as classes starts barely at the start of semester

#### Lets focus on day

In [142]:
df[df['last_updated'].dt.year!=2022]['last_updated'].dt.day.value_counts().sort_index()

1      915
2      888
3      922
4      832
5      921
6     1064
7     1076
8      946
9      986
10    1020
11    1027
12     862
13     897
14     873
15     865
16     912
17     991
18     887
19     981
20     841
21    1093
22    1189
23     987
24    1053
25     898
26    1146
27     991
28    2326
29    1107
30    1110
31     662
Name: last_updated, dtype: int64

In [151]:
px.histogram(df[df['last_updated'].dt.year!=2022]['last_updated'].dt.day,range_y=[600,2350])

##### Lots of articles are updated in 28th of the month, which more the double of other days. This is definitely fishy.

# 5. Analysis on title column

#### Lets split the words in title column and group them to find most frequent word to find popular keywords

In [156]:
df['title'].value_counts()

GATE | GATE CS 2013 | Question 65                       19
GATE | GATE-CS-2015 (Set 1) | Question 65               18
Amazon Interview Experience for SDE-1                   18
GATE | GATE-CS-2014-(Set-1) | Question 65               17
GATE | GATE-CS-2014-(Set-3) | Question 65               14
                                                        ..
Python Program for Sum the digits of a given number      1
Spurious Tuples in DBMS                                  1
Normalization vs Standardization                         1
How to Download and Install Java for 64 bit machine?     1
Data Structures and Algorithms | Set 21                  1
Name: title, Length: 33981, dtype: int64

There are some titles which are repeated

In [180]:
#for each title in title column, split the title and append each word in the title to a list
allkeywords=[]
for title in df['title']:
    for word in title.split():
        allkeywords.append(word.lower())

In [181]:
#Find frequency of each word by using Counter
freq_word=Counter(allkeywords)
freq_word

Counter({'5': 295,
         'best': 93,
         'practices': 11,
         'for': 2736,
         'writing': 42,
         'sql': 248,
         'joins': 4,
         'foundation': 3,
         'css': 250,
         'dropdown': 10,
         'menu': 24,
         'top': 212,
         '20': 58,
         'excel': 83,
         'shortcuts': 7,
         'that': 995,
         'you': 85,
         'need': 26,
         'to': 6619,
         'know': 39,
         'servlet': 9,
         '–': 1695,
         'fetching': 4,
         'result': 17,
         'suffix': 43,
         'sum': 1559,
         'array': 2336,
         'kelvin': 2,
         'celsius': 4,
         'formula': 21,
         'how': 2286,
         'install': 101,
         'mongodb': 49,
         'vscode?': 1,
         '7': 229,
         'highest': 30,
         'paying': 8,
         'programming': 306,
         'languages': 43,
         'freelancers': 1,
         'in': 10173,
         '2022': 85,
         'free': 52,
         'resume': 10,
     

Among them words with length less then or equal to 2 are not necessary, as they include lot of unnecessary words

In [183]:
#Filter out words with length less than or equal to 2
freq_word={i:freq_word[i] for i in freq_word if len(i)>2}

In [190]:
pd.Series(freq_word).sort_values(ascending=False)[:20]

and           4456
interview     4322
experience    3862
the           3513
set           3139
for           2736
python        2689
using         2514
with          2494
array         2336
how           2286
given         2162
number        2103
question      1830
java          1828
find          1636
sum           1559
program       1430
string        1404
from          1379
dtype: int64

### Among the most occuring words, the words which make sense as keywords are only few.
### Those meaningful keywords in top 20 are 
 ##### 1.Interview 
 ##### 2.Python 
 ##### 3. Array 
 ##### 4.Java 
 ##### 5. String 

# 6. Count of articles with above popular keywords

In [192]:
pd.Series(freq_word).sort_values(ascending=False)[['interview','python','array','java','string']]

interview    4322
python       2689
array        2336
java         1828
string       1404
dtype: int64

In [196]:
t=pd.Series(freq_word).sort_values(ascending=False)[['interview','python','array','java','string']]
px.histogram(t,t.index,t.values)

# 7. Articles of my favourite author Striver😍

In [199]:
df[df['author_id']=='Striver']

Unnamed: 0,title,author_id,last_updated,link,category
3188,Sum of the sums of all possible subsets,Striver,2022-02-15,https://www.geeksforgeeks.org/sum-of-the-sums-...,easy
3205,Maximum Bitwise AND value of subsequence of le...,Striver,2021-06-08,https://www.geeksforgeeks.org/maximum-bitwise-...,easy
3524,Given two arrays count all pairs whose sum is ...,Striver,2021-03-03,https://www.geeksforgeeks.org/given-two-arrays...,easy
3537,Find the value of N when F(N) = f(a)+f(b) wher...,Striver,2021-05-11,https://www.geeksforgeeks.org/find-the-value-o...,easy
3565,Number of subsequences with zero sum,Striver,2021-05-11,https://www.geeksforgeeks.org/number-of-subseq...,easy
...,...,...,...,...,...
34158,Check if the given Prufer sequence is valid or...,Striver,2021-05-07,https://www.geeksforgeeks.org/check-if-the-giv...,expert
34218,Largest subset of rectangles such that no rect...,Striver,2021-06-15,https://www.geeksforgeeks.org/largest-subset-o...,expert
34391,Python string | isdecimal(),Striver,2018-01-05,https://www.geeksforgeeks.org/python-string-is...,expert
34396,Python | Maximum sum of elements of list in a ...,Striver,2018-11-21,https://www.geeksforgeeks.org/python-maximum-s...,expert


#### There are total of 261 articles of striver

In [204]:
df[df['author_id']=='Striver'][['title','category']].groupby('category').count()

Unnamed: 0_level_0,title
category,Unnamed: 1_level_1
basic,71
easy,70
expert,5
hard,25
medium,90


#### Striver wrote lot of medium level articles. He also wrote some of the hard, expert articles

In [208]:
df[(df['author_id']=='Striver') & ((df['category']=='hard') | (df['category']=='expert'))]

Unnamed: 0,title,author_id,last_updated,link,category
30437,Arrange N elements in circular fashion such th...,Striver,2022-02-02,https://www.geeksforgeeks.org/arrange-n-elemen...,hard
30522,Print the degree of every node from the given ...,Striver,2021-11-05,https://www.geeksforgeeks.org/print-the-degree...,hard
30594,Count pairs of non-overlapping palindromic sub...,Striver,2021-05-21,https://www.geeksforgeeks.org/count-pairs-of-n...,hard
30617,Maximum sum of nodes in Binary tree such that ...,Striver,2020-12-04,https://www.geeksforgeeks.org/maximum-sum-of-n...,hard
30636,Print the longest prefix of the given string w...,Striver,2021-05-28,https://www.geeksforgeeks.org/print-the-longes...,hard
30642,Check if matrix can be converted to another ma...,Striver,2021-05-28,https://www.geeksforgeeks.org/check-if-matrix-...,hard
30716,Count number of sub-sequences with GCD 1,Striver,2021-11-25,https://www.geeksforgeeks.org/count-number-of-...,hard
30727,Count pairs of parentheses sequences such that...,Striver,2021-05-31,https://www.geeksforgeeks.org/count-pairs-of-p...,hard
30745,Predict the winner of the game | Sprague-Grundy,Striver,2018-12-25,https://www.geeksforgeeks.org/predict-the-winn...,hard
30747,Ways to fill N positions using M colors such t...,Striver,2021-05-26,https://www.geeksforgeeks.org/ways-to-fill-n-p...,hard


##### Above are the articles of striver which are of hard, expert level.

# Some insights

#### 1. Popular authors are 
Geeks for Geeks, ManasChhabra2, Striver, manjeet_04, Chinmoy Lenka, pawan_asipu, sarthak_ishu11, anuupadhyay,Shubrodeep Banerjee, ankita_saini     
#### 2. Lots of articles are updated in 2021 which is fishy
#### 3. Lots of articles are updated in june month
#### 4. Lots of articles are updated in 28 of the month which is fishy.
#### 5. Popular keywords are Interview, Python, Array, Java, String
#### Count of articles with those popular keywords are 
##### interview - 4322
##### python - 2689
##### array - 2336
##### java - 1828
##### string - 1404