In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('./Data/popular_git_repositories.csv')
df.head()

Unnamed: 0,Name,Description,URL,Created At,Updated At,Homepage,Size,Stars,Forks,Issues,...,Has Issues,Has Projects,Has Downloads,Has Wiki,Has Pages,Has Discussions,Is Fork,Is Archived,Is Template,Default Branch
0,freeCodeCamp,freeCodeCamp.org's open-source codebase and cu...,https://github.com/freeCodeCamp/freeCodeCamp,2014-12-24T17:49:19Z,2023-09-21T11:32:33Z,http://contribute.freecodecamp.org/,387451,374074,33599,248,...,True,True,True,False,True,False,False,False,False,main
1,free-programming-books,:books: Freely available programming books,https://github.com/EbookFoundation/free-progra...,2013-10-11T06:50:37Z,2023-09-21T11:09:25Z,https://ebookfoundation.github.io/free-program...,17087,298393,57194,46,...,True,False,True,False,True,False,False,False,False,main
2,awesome,😎 Awesome lists about all kinds of interesting...,https://github.com/sindresorhus/awesome,2014-07-11T13:42:37Z,2023-09-21T11:18:22Z,,1441,269997,26485,61,...,True,False,True,False,True,False,False,False,False,main
3,996.ICU,Repo for counting stars and contributing. Pres...,https://github.com/996icu/996.ICU,2019-03-26T07:31:14Z,2023-09-21T08:09:01Z,https://996.icu,187799,267901,21497,16712,...,False,False,True,False,False,False,False,True,False,master
4,coding-interview-university,A complete computer science study plan to beco...,https://github.com/jwasham/coding-interview-un...,2016-06-06T02:34:12Z,2023-09-21T10:54:48Z,,20998,265161,69434,56,...,True,False,True,False,False,False,False,False,False,main


## Are there any duplicated values?

In [3]:
print("Number of duplicated values:", df.duplicated().sum())

Number of duplicated values: 0


## With each numerical column, how are values distributed?
- What is the percentage of missing values?
- Min? max? Are they abnormal?

### Percentage of missing values

In [4]:
missing_vals = df.select_dtypes(include='number').isna().sum()
missing_percentage = missing_vals / len(df)
missing_percentage

Size        0.0
Stars       0.0
Forks       0.0
Issues      0.0
Watchers    0.0
dtype: float64

### Min? Max? Are they abnormal?

In [5]:
df.describe()

Unnamed: 0,Size,Stars,Forks,Issues,Watchers
count,215029.0,215029.0,215029.0,215029.0,215029.0
mean,54282.7,1115.085142,234.207637,37.925178,1115.085142
std,702397.8,3992.37205,1242.967451,196.50861,3992.37205
min,0.0,167.0,0.0,0.0,167.0
25%,378.0,237.0,39.0,3.0,237.0
50%,2389.0,377.0,79.0,10.0,377.0
75%,15282.0,797.0,174.0,28.0,797.0
max,105078600.0,374074.0,243339.0,26543.0,374074.0


For the 4 features **Stars, Forks, Issues and Watchers**, the *max* value seems to be very big compared to the rest of its column's data, 100 times larger than the 75% quantile.

But, this is **understandable** because it means there are some repositories that are *hugely more popular* than most other repositories.

## With each categorical column, how are values distributed?
- What is the percentage of missing values?
- How many different values? Show a few
- Are they abnormal?

### What is the percentage of missing values?

In [6]:
missing_vals = df.select_dtypes(exclude='number').isna().sum()
missing_percentage = missing_vals / len(df)
missing_percentage

Name               0.000009
Description        0.037353
URL                0.000000
Created At         0.000000
Updated At         0.000000
Homepage           0.635445
Language           0.074762
License            0.246660
Topics             0.000000
Has Issues         0.000000
Has Projects       0.000000
Has Downloads      0.000000
Has Wiki           0.000000
Has Pages          0.000000
Has Discussions    0.000000
Is Fork            0.000000
Is Archived        0.000000
Is Template        0.000000
Default Branch     0.000000
dtype: float64

Features with missing values are **Description, Homepage, Language and License.**

Notably, **Homepage** has a missing percentage of **0.635445** (over half of the data).

### How many different values? Show a few

In [7]:
categorical_vals = df.select_dtypes(exclude='number')

unique_vals = categorical_vals.apply(pd.Series.unique, axis=0)
unique_counts = unique_vals.apply(len)

Number of unique values for each column

In [8]:
print(unique_counts)

Name               196821
Description        206110
URL                215029
Created At         214922
Updated At         193011
Homepage            74198
Language              370
License                46
Topics             110123
Has Issues              2
Has Projects            2
Has Downloads           2
Has Wiki                2
Has Pages               2
Has Discussions         2
Is Fork                 1
Is Archived             2
Is Template             2
Default Branch       2326
dtype: int64


Some values for each column

In [9]:
print(unique_vals)

Name               [freeCodeCamp, free-programming-books, awesome...
Description        [freeCodeCamp.org's open-source codebase and c...
URL                [https://github.com/freeCodeCamp/freeCodeCamp,...
Created At         [2014-12-24T17:49:19Z, 2013-10-11T06:50:37Z, 2...
Updated At         [2023-09-21T11:32:33Z, 2023-09-21T11:09:25Z, 2...
Homepage           [http://contribute.freecodecamp.org/, https://...
Language           [TypeScript, nan, Python, JavaScript, C++, She...
License            [BSD-3-Clause, CC-BY-4.0, CC0-1.0, NOASSERTION...
Topics             [['careers', 'certification', 'community', 'cu...
Has Issues                                             [True, False]
Has Projects                                           [True, False]
Has Downloads                                          [True, False]
Has Wiki                                               [False, True]
Has Pages                                              [True, False]
Has Discussions                   

### Are they abnormal?

- **Is Fork** has only 1 value: False. We can remove this feature.

In [10]:
df = df.drop('Is Fork', axis=1)