<a href="https://colab.research.google.com/github/imtufail/Machine-Learning-Contents/blob/main/Understanding-Data/Basic_Questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# distribution in a dataset
 helps you understand how values are spread across different columns. This is a key step before modeling or preprocessing.

---

### ✅ Useful information you can extract from distributions:

#### 1. **Value concentration (skewness)**

* Do most values fall in a small range?
* Is the data **left-skewed**, **right-skewed**, or **symmetric**?

Helps in choosing normalization or transformation methods.

---

#### 2. **Outliers**

* Do some values fall far outside the typical range?

Outliers can distort models and may need to be removed or capped.

---

#### 3. **Class imbalance (for categorical variables)**

* Are some classes much more frequent than others?

Useful in classification problems. Example: 95% class A, 5% class B → model may always predict A.

---

#### 4. **Spread (variance, standard deviation)**

* Is the column tightly packed or widely spread?

Important for algorithms sensitive to scale (e.g., k-NN, SVM).

---

#### 5. **Data types and encoding needs**

* Are categorical values evenly distributed or dominated by a few?

This helps decide whether to use one-hot, label, or frequency encoding.

---

#### 6. **Missing values or zero dominance**

* Does a feature have too many zeroes or NaNs?

High zero-rate columns may need special treatment or removal.

---

### 🔍 Why distributions are applied per column:

Each column carries different meaning and scale:

* Age: right-skewed (more young people)
* Income: highly skewed (few rich, many average)
* Gender: balanced or imbalanced

Understanding **per-column distribution** tells you:

* What transformations to apply (log, binning, etc.)
* Whether scaling is needed
* Which features are useful, redundant, or harmful

---

### 🔧 Tools to view distributions:

* `df['column'].hist()` or `sns.histplot(df['column'])`
* `sns.boxplot(df['column'])`
* `value_counts()` for categorical
* `describe()` for numeric summaries

---




# Basics Questions

## 1. How big is the data?

In [3]:
import pandas as pd
df = pd.read_csv('/content/train.xls',low_memory=False)

In [7]:
df2 = pd.read_csv('/content/Quran-Dataset.xls',low_memory=False)

In [10]:
df.shape

(891, 12)

In [11]:
df2.shape

(114, 7)

# 2. how does the data look like?

In [12]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
df2.head()

Unnamed: 0.1,Unnamed: 0,chapter,name,englishname,arabicname,revelation,verses
0,0,1,Al-Faatiha,The Opening,سُوْرَةُ الْفَاتِحَةِ,Mecca,"[{'verse': 1, 'line': 1, 'juz': 1, 'manzil': 1..."
1,1,2,Al-Baqara,The Cow,سُوْرَةُ البَقَرَةِ,Madina,"[{'verse': 1, 'line': 8, 'juz': 1, 'manzil': 1..."
2,2,3,Aal-i-Imraan,The Family of Imraan,سُوْرَةُ اٰلِ عِمْرٰنَ,Madina,"[{'verse': 1, 'line': 294, 'juz': 3, 'manzil':..."
3,3,4,An-Nisaa,The Women,سُوْرَةُ النِّسَآءِ,Madina,"[{'verse': 1, 'line': 494, 'juz': 4, 'manzil':..."
4,4,5,Al-Maaida,The Table,سُوْرَةُ المَآئِدَةِ,Madina,"[{'verse': 1, 'line': 670, 'juz': 6, 'manzil':..."


In [14]:
df.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
385,386,0,2,"Davies, Mr. Charles Henry",male,18.0,0,0,S.O.C. 14879,73.5,,S
364,365,0,3,"O'Brien, Mr. Thomas",male,,1,0,370365,15.5,,Q
657,658,0,3,"Bourke, Mrs. John (Catherine)",female,32.0,1,1,364849,15.5,,Q
693,694,0,3,"Saad, Mr. Khalil",male,25.0,0,0,2672,7.225,,C
141,142,1,3,"Nysten, Miss. Anna Sofia",female,22.0,0,0,347081,7.75,,S


In [15]:
df2.sample(10)

Unnamed: 0.1,Unnamed: 0,chapter,name,englishname,arabicname,revelation,verses
8,8,9,At-Tawba,The Repentance,سُوْرَةُ التَّوْبَةِ,Madina,"[{'verse': 1, 'line': 1236, 'juz': 10, 'manzil..."
108,108,109,Al-Kaafiroon,The Disbelievers,سُوْرَةُ الْكَافِرُوْنَ,Mecca,"[{'verse': 1, 'line': 6208, 'juz': 30, 'manzil..."
82,82,83,Al-Mutaffifin,Defrauding,سُوْرَةُ المُطَفِّفِيْنَ,Mecca,"[{'verse': 1, 'line': 5849, 'juz': 30, 'manzil..."
72,72,73,Al-Muzzammil,The Enshrouded One,سُوْرَةُ الْمُزَّمِّلِ,Mecca,"[{'verse': 1, 'line': 5476, 'juz': 29, 'manzil..."
1,1,2,Al-Baqara,The Cow,سُوْرَةُ البَقَرَةِ,Madina,"[{'verse': 1, 'line': 8, 'juz': 1, 'manzil': 1..."
36,36,37,As-Saaffaat,Those drawn up in Ranks,سُوْرَةُ الصَّافَّاتِ,Mecca,"[{'verse': 1, 'line': 3789, 'juz': 23, 'manzil..."
2,2,3,Aal-i-Imraan,The Family of Imraan,سُوْرَةُ اٰلِ عِمْرٰنَ,Madina,"[{'verse': 1, 'line': 294, 'juz': 3, 'manzil':..."
23,23,24,An-Noor,The Light,سُوْرَةُ النُّوْرِ,Madina,"[{'verse': 1, 'line': 2792, 'juz': 18, 'manzil..."
58,58,59,Al-Hashr,The Exile,سُوْرَةُ الْحَشْرِ,Madina,"[{'verse': 1, 'line': 5127, 'juz': 28, 'manzil..."
53,53,54,Al-Qamar,The Moon,سُوْرَةُ الْقَمَرِ,Mecca,"[{'verse': 1, 'line': 4847, 'juz': 27, 'manzil..."


In [16]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [17]:
df2.tail()

Unnamed: 0.1,Unnamed: 0,chapter,name,englishname,arabicname,revelation,verses
109,109,110,An-Nasr,Divine Support,سُوْرَةُ النَّصْرِ,Madina,"[{'verse': 1, 'line': 6214, 'juz': 30, 'manzil..."
110,110,111,Al-Masad,The Palm Fibre,سُوْرَةُ المَسَدِ,Mecca,"[{'verse': 1, 'line': 6217, 'juz': 30, 'manzil..."
111,111,112,Al-Ikhlaas,Sincerity,سُوْرَةُ الْاِخْلَاصِ,Mecca,"[{'verse': 1, 'line': 6222, 'juz': 30, 'manzil..."
112,112,113,Al-Falaq,The Dawn,سُوْرَةُ الْفَلَقِ,Mecca,"[{'verse': 1, 'line': 6226, 'juz': 30, 'manzil..."
113,113,114,An-Naas,Mankind,سُوْرَةُ النَّاسِ,Mecca,"[{'verse': 1, 'line': 6231, 'juz': 30, 'manzil..."


# 3. What is the datatype of Cols?

### we can check the datatype and may be required to change the datatype of some columns in the future to speed up the analysis process and to save the usage of memory.
For example to change the type of date/time column into date type rather than object, simillarly to change the age column data type to integer/short number rather than float64

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### in the above example it also tell us about how many columns have null values for example look at age column; it says that 714 are non- null out of 891

In [19]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114 entries, 0 to 113
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Unnamed: 0   114 non-null    int64 
 1   chapter      114 non-null    int64 
 2   name         114 non-null    object
 3   englishname  114 non-null    object
 4   arabicname   114 non-null    object
 5   revelation   114 non-null    object
 6   verses       114 non-null    object
dtypes: int64(2), object(5)
memory usage: 6.4+ KB


# 4. Are there any missing values?

In [21]:
df.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [23]:
df2.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
chapter,0
name,0
englishname,0
arabicname,0
revelation,0
verses,0


# 5. How does the data look like mathematically?

In [24]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### it describes the data, like you can see that age column says that most of the people were 29 years, 25% were 20 years, 50% were less than or 28 years, 75% age of the people is less than or equal to 38 years and so on

In [26]:
df2.describe()

Unnamed: 0.1,Unnamed: 0,chapter
count,114.0,114.0
mean,56.5,57.5
std,33.052988,33.052988
min,0.0,1.0
25%,28.25,29.25
50%,56.5,57.5
75%,84.75,85.75
max,113.0,114.0


# 6.Are there duplicated values?

In [27]:
df.duplicated().sum()

np.int64(0)

In [28]:
df2.duplicated().sum()

np.int64(0)

# 7. How is the co-relation b/w cols?
it tells us about what is the relationship between columns, if one value is reducing so what would be the effect of it on the other values



In [32]:
df.corr()

ValueError: could not convert string to float: 'Braund, Mr. Owen Harris'

In [33]:
df_numeric = df.select_dtypes(include='number')

In [34]:
df_numeric.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.036847,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.369226,0.083081,0.018443,-0.5495
Age,0.036847,-0.077221,-0.369226,1.0,-0.308247,-0.189119,0.096067
SibSp,-0.057527,-0.035322,0.083081,-0.308247,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.189119,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.096067,0.159651,0.216225,1.0


In [36]:
df_numeric.corr()['Survived']

Unnamed: 0,Survived
PassengerId,-0.005007
Survived,1.0
Pclass,-0.338481
Age,-0.077221
SibSp,-0.035322
Parch,0.081629
Fare,0.257307


### it shows that sruvived is highly co-related with fare.