data parsing involves:

- Data Reading: Accessing and reading the file from the source.
- Data Loading: Bringing the data into memory, often into a DataFrame.
- Data Splitting: Dividing the data into rows and columns based on delimiters.
- Type Conversion: Converting data into appropriate types (e.g., integers, floats) for proper analysis and manipulation.
- These steps ensure that raw data is correctly formatted and structured for use in analysis.

**Need of Data Splitting**
- Rows Aur Columns Ko Alag Karna:
- Raw data file mein, delimiter (e.g., comma, pipe) rows aur columns ko define karta hai. Splitting se data ko proper rows aur columns mein organize kiya jata hai.

**Structured Data Creation:**
- Data splitting se ek structured format (e.g., DataFrame) banaya jata hai, jahan rows aur columns clearly defined hote hain,
 - aur data ko easily analyze aur manipulate kiya ja sakta hai.

 

### Part 1
- Import the necessary libraries
- Import the dataset 
- Assign it to a variable called users and use the 'user_id' as index


In [2]:
#Importing the library
import pandas as pd
#importing the dataset
URL="https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
#1. pd.read_csv() Function:
#This is a function provided by the pandas library, used to read a CSV (Comma-Separated Values) file into a DataFrame. 
#The CSV format is a common file format for storing tabular data, where each 
#line represents a row of data and each value within a line is separated by a comma (or another delimiter).

#sep='|':

#The sep parameter specifies the delimiter used in the file to separate values. By default, read_csv() assumes the delimiter is a comma (,).
#However, in this dataset, the values are separated by a pipe (|) character. 
#Therefore, we specify sep='|' to tell pandas that this is the delimiter in your file.

#index_col sets the column as the index immediately during file loading, while set_index is used to set the index after the file has been loaded.

users=pd.read_csv(URL,sep='|') #,index_col='user_id'          to set column as index while reaing the csv file
print(f"length of columns before setting a column as index {len(users.columns)}\n")
users = users.set_index('user_id')
print(f"length of columns after setting a column as index {len(users.columns)}\n")
print(users.to_string())



URLError: <urlopen error [Errno 11001] getaddrinfo failed>

#### part 2
- See the first 25 entries
- See the last 10 entries


In [None]:

print(users.head(25))
print("\n")
print(users.tail(10))
print("\n")
#In a dataset, the number of observations typically refers to the number of rows in the dataset. 
# Each row represents a single observation or record.



### Part 3
- What is the number of observations in the dataset?
- What is the number of columns in the dataset?
- Print the name of all the columns.



In [None]:
import pandas as pd
URL="https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
users=pd.read_csv(URL,sep='|',index_col='user_id') 

#Using the shape attribute:
#The shape attribute of a DataFrame returns a tuple where the first element is the number of rows (observations) and the second element is the number of columns.


print(f"1: No of observations using shape attribute is: {users.shape[0]}")
print(f"\n2: No of observations using len() method is: {len(users)}")
print(f"\n3: No of observations using count() , max() method is: {users.count().max()}")
# ---------------------------------------Read markdown 3 for explaination--------------------------- 

print(f"\n4: No of columns in the dataset using shape attribute is: {users.shape[1] }")
print(f"\n5: No of columns in the dataset using len() is: {len(users.columns)}")
print(f"\n\n\n6: No of rows and columns using shape attribute only: {users.shape }")


#-------------------------------------------Printing the names of all columns------------------------------

print(f"\n7: printing the column names using users.columns         {users.columns}")
print(f"\n8: printing the column names using users.columns.tolist()      {users.columns.tolist()}")

"""users.columns: Returns an Index object with the column names.
.tolist(): Converts the Index object to a regular Python list if you want a list format.
 If we just want to print the column names directly, we can use df.columns without .tolist().
.tolist() is used to convert pandas objects (like Index, Series, or columns) into Python lists.
------------------Example is given in head/tail file--------------------------------------------
It helps in making data compatible with other parts of a Python program, simplifies manipulation, and improves readability and debugging."""



### Part 4
- How is the dataset indexed?
- What is the data type of each column?
- Print only the occupation column


In [None]:
import pandas as pd
URL="https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
users=pd.read_csv(URL,sep='|',index_col='user_id') 







#-------How is the dataset indexed
print(f"\n 1: rows indexing without to_list(): {users.index}")
print(f"\n\n 2: rows indexing with to_list(): {users.index.to_list()}")
print(f"\n\n 3: Column indexing without to_list(): {users.columns}")
print(f"\n\n 4: Column indexing with to_list(): {users.columns.to_list()}") #Column Names are the labels for the DataFrame columns.
                         #Column Index is the pandas object that contains these labels and provides additional functionalities for indexing.
print(f"\n\n5: Data Type of each Column \n{users.dtypes}")
print(f"\n\n 6: Occupation Column using dot notation is \n{users.occupation}")
print(f"\n\n 7: Occupation Column using bracket notation is \n{users['occupation']}")
#Read the Markdown 2 to see the differences b/w using dot and bracket notation

### Part 5
- How many different occupations are in this dataset?
- What is the most frequent occupation?


In [11]:
import pandas as pd
URL="https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
users=pd.read_csv(URL,sep='|',index_col='user_id') 









#The nunique() method in pandas is used to count the number of unique values in a Series or DataFrame. 
# It's a convenient and efficient way to determine how many distinct items are present in a dataset.
#------------------------------------Read the markdown#4------------------------------------------------
print(f" 1: The no of different occupations in the dataset using nunique() are:     {users['occupation'].nunique()}")
print()
print()
print()
print(users['occupation'].nunique())
#print(f"\n\n 2: The different occupations in the dataset using value_counts() & counts() is:     {users.occupation.value_counts().count()}")
print(f"\n\n 3: The different occupations in the dataset using value_counts() only:\n{users['occupation'].value_counts()}")





#The idxmax() function is used to identify the index of the maximum value in a pandas Series
#print(f"\n\n The most frequent occupation using idxmax is: {users['occupation'].value_counts().idxmax()}\n \n")
#print(f"\n\n The most frequent occupation using value_counts() is: {users.occupation.value_counts().head(1).index[0]}")
#print(f"\n\n\n The most frequent occupation using value_counts() but without index[0] is: {users.occupation.value_counts().head(1)}")
#--------------------------------------Read explaination in MarkDown 5-------------------------------------------------------





#To access the data of student only
#print(users.loc[users["occupation"]=='student'])


 1: The no of different occupations in the dataset using nunique() are:     21



21


 3: The different occupations in the dataset using value_counts() only:
occupation
student          196
other            105
educator          95
administrator     79
engineer          67
programmer        66
librarian         51
writer            45
executive         32
scientist         31
artist            28
technician        27
marketing         26
entertainment     18
healthcare        16
retired           14
lawyer            12
salesman          12
none               9
homemaker          7
doctor             7
Name: count, dtype: int64


### Part 6
- Summarize the DataFrame.
- Summarize all the columns
- Summarize only the occupation column


In [None]:
import pandas as pd
URL="https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
users=pd.read_csv(URL,sep='|',index_col='user_id') 







#----------------------------------Summarizing the dataset-----------------------------------
#----------------------------------Markdown# 7,8-----------------------------------------------
print(f"\n\n1: Summary of data frame using describe()\n{users.describe()}")
print(f"\n\n2: Summary of dataframe using info() is:\n")
print(users.info())
print("\n\n\n")
users.info() #directly printing to console thus no None value
#In Python, when a function does not have a return statement, it implicitly returns None. The info() method in pandas is designed 
# to print a summary of the DataFrame to the console but does not return any value. Therefore, its return value is None.
# If you use print() with info(), it prints the output of info() (which is None)
#info() gives a summary of both numerical and categorical columns. It shows: Column Names , Data Types (e.g., int64,,object for categorical data) ,Non-Null Counts for each column  ,Memory Usage
#describe() is best for detailed statistical summaries of numerical columns.

print(f"\n\n3: Summary of all columns\n\n{users.describe(include='all')}")

print(f"\n\n4: Summary of occupation column is:\n{users['occupation'].describe()}")

### Part 7
- What is the mean age of users?
- What is the age with least occurrence?



In [None]:
import pandas as pd
URL="https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
users=pd.read_csv(URL,sep='|',index_col='user_id') 





mean_age=users['age'].mean()
Round=round(mean_age,2)
print("\n\n Mean age of users is: ",Round)

print(f"\n\n The age with least Occurrance is:\n{users['age'].value_counts().tail()}")

### Part 8
##### Find mean, mode, median, Standard Deviation and Variance of each Numeric column (Not Categorical) in the Dataset (For this use Numpy)

### Find the Maximum and Minimum from each Numeric column
- Mean: The average value of a dataset, calculated by summing all values and dividing by the number of values.
- Mode: The value that appears most frequently in a dataset.
- Median: The middle value of a dataset when it is ordered from least to greatest.
- Standard Deviation: A measure of the amount of variation or dispersion in a dataset. Standard Deviation is the square root of the variance and gives the spread in the same units as the data, making it easier to interpret.
- Variance: The average of the squared differences from the mean, indicating how much the data deviates from the mean. Variance measures how far each number in the dataset is from the mean, in squared units.
##### Example Dataset: `[10, 12, 14, 16, 18]`
1. **Calculate Mean:**
   \[
   Mean =10 + 12 + 14 + 16 + 18/5 = 14
   \]

2. **Calculate Each Value’s Deviation from the Mean and Square It:**
   - \((10 - 14)^2 = 16\)
   - \((12 - 14)^2 = 4\)
   - \((14 - 14)^2 = 0\)
   - \((16 - 14)^2 = 4\)
   - \((18 - 14)^2 = 16\)

3. **Find the Average of These Squared Deviations (Variance):**
   \[
{Variance} = {16 + 4 + 0 + 4 + 16}/{5} = 8
   \]

4. **Calculate Standard Deviation (Square Root of Variance):**
   \[
   {Standard Deviation} = sqrt{8} \approx 2.83
   \]


In [None]:
import pandas as pd
import numpy as np

URL="https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
users=pd.read_csv(URL,sep='|',index_col='user_id' ) 
# Identify numeric columns using describe()
numeric_cols = users.describe()

# Initialize dictionaries to store results
#mean_dict[col] = np.mean(data) stores the mean of the column in mean_dict with the column name as the key.
mean_dict = {}
mode_dict = {}
median_dict = {}
std_dev_dict = {}
variance_dict = {}
max_dict = {}
min_dict = {}

# Calculate statistics for each numeric column
for col in numeric_cols:
    mean_dict[col] = users[col].mean()
    mode_dict[col] = users[col].mode().values[0]
    median_dict[col] = users[col].median()
    std_dev_dict[col] = users[col].std()
    variance_dict[col] = users[col].var()
    max_dict[col] = users[col].max()
    min_dict[col] = users[col].min()
    
   
# Display results
print("Mean:\n", mean_dict)
print("\nMode:\n", mode_dict)
print("\nMedian:\n", median_dict)
print("\nStandard Deviation:\n", std_dev_dict)
print("\nVariance:\n", variance_dict)
print("\nMaximum:\n", max_dict)
print("\nMinimum:\n", min_dict)

Mean:
 {'age': np.float64(34.05196182396607)}

Mode:
 {'age': np.int64(30)}

Median:
 {'age': np.float64(31.0)}

Standard Deviation:
 {'age': np.float64(12.192739733059032)}

Variance:
 {'age': np.float64(148.66290219811643)}

Maximum:
 {'age': np.int64(73)}

Minimum:
 {'age': np.int64(7)}


#### Markdown 2
##### Dot and Bracket Notation
1. Dot Notation (users.occupation)
- Usage: users.occupation is a shorthand notation that allows you to access a DataFrame column as if it were an attribute of the DataFrame.
- Behavior: This method works when the column name is a valid Python identifier (e.g., it doesn’t contain spaces, special characters, or conflict with DataFrame methods).

- Limitation: It cannot be used if the column name is not a valid attribute name (e.g., contains spaces or special characters) or if it conflicts with existing DataFrame methods.

2. Bracket Notation (users['occupation'])
- Usage: users['occupation'] is the more general and flexible method for accessing DataFrame columns. It uses a string to specify the column name.
- Behavior: This method works with any column name, including those with spaces or special characters.
- Advantage: It avoids conflicts with DataFrame methods and allows for dynamic column access.


- ------------------------------------------------------------------------------------------------------------------------------------------
- ------------------------------------------------------------------------------------------------------------------------------------------

##### Markdown 3
- The users.count() method returns a Series where each entry represents the count of non-null values for each column.
- In my dataset, each column has 943 non-null values, so users.count() provides  943
- When I use users.count().max(), I am asking for the maximum value from this Series. 
- Since all columns have the same number of non-null values (943), max() returns 943.

**Why Use max()?**
- The max() function is used here to extract the highest count of non-null values among all columns. 
- It’s useful when we want to:

- Verify Consistency: Ensure that all columns have the same number of non-null entries, as in my case. 
- If columns have varying counts, max() would show the highest count.

**Determine Data Completeness:**
- In some scenarios, we might want to know the maximum number of non-null entries to understand the completeness of our data.
- ------------------Example is explained in head-tail.ipynb file-------------------

- ------------------------------------------------------------------------------------------------------------------------------------------
- ------------------------------------------------------------------------------------------------------------------------------------------

#### Markdown 4
- value_counts(): This method returns a Series where the index is the unique values in the 'occupation' column and the values are the counts of these unique values.

- .count(): After value_counts() provides the frequency counts, .count() is called to get the number of entries in the resulting Series, which corresponds to the number of unique values.

- users['occupation'].value_counts() returns a Series where each unique occupation is listed along with its frequency.

- Calling .count() on this Series will give you the number of unique occupations because each unique occupation is a separate entry in the Series.

- nunique(): If you just want to know how many different occupations there are, nunique() is the more straightforward choice.

- value_counts(): If you need to understand the distribution of occupations 
- (e.g., how many people have each occupation), value_counts() provides more detailed information.

- ------------------------------------------------------------------------------------------------------------------------------------------
- ------------------------------------------------------------------------------------------------------------------------------------------

#### Markdown 5
1. users.occupation.value_counts()

- Purpose: Counts the occurrences of each unique value in the occupation column.
- Output: A Series where the index is the unique values (occupations) and the values are the counts. 
- This Series is sorted in descending order by default, so the most frequent occupation comes first.

2. head(1)

- Purpose: Retrieves the first row of the Series, which corresponds to the most frequent value because value_counts() sorts the Series in descending order.
- Output: A Series containing only the most frequent value and its count

3. index[0]

- Purpose: Extracts the index (which is the occupation) from the Series obtained in the previous step.
- Output: The name of the most frequent occupation.

###### .index[0]

- Purpose: Extracts the index (occupation name) of the Series from the result of .head(1).[0]: 
- Since .head(1) returns a Series with a single entry (the most frequent occupation), 
- .index[0] accesses the index of this single entry, which is the name of the most frequent occupation.

###### Why index[0] and Not Others?
- index[0]: Since .head(1) returns a Series with only one element, index[0] is used to get the index (occupation name) of this single element. 
- There is no index[1], index[2], etc., because there is only one entry.

- Other Indexes: If you use index[1], index[2], etc., it will lead to an IndexError because these indexes do not exist in the result from .head(1) (which contains only the top row).

- ------------------------------------------------------------------------------------------------------------------------------------------
- ------------------------------------------------------------------------------------------------------------------------------------------


### Markdown 6
##### Summarizing a DataFrame or dataset
 - In pandas means getting a quick and clear overview of the main features and characteristics of your data. Here's what it means in simple terms:
**1. Overview of Structure:**
- **Columns and Data Types:** Knowing what columns (features) are present and what type of data each column holds (like numbers or text).
- **Rows Count:** How many entries or records are there in the dataset.
**2. Quick Peek at Data:**
- **First Few Rows:** Seeing a small sample of the actual data to get a sense of what it looks like.
**3. Basic Statistics:**
- **Numbers Summary:** Getting statistics like average, minimum, and maximum values for numeric columns, which helps understand the range and spread of the numbers.
**4. Check for Missing Data:**
- **Missing Values:** Identifying if there are any gaps or missing information in your dataset.
**5. Unique Values:**
- **Frequency Counts:** Knowing how often certain values appear in a column, especially for categorical data.
**Why Summarize?**
- **Understand Your Data:** Helps you quickly understand what your data looks like and identify any issues or interesting patterns.
- **Prepare for Analysis:** Provides essential insights needed to analyze the data effectively and make informed decisions.
**Example:**
- Imagine you have a spreadsheet with information about students, like their names, ages, and scores. Summarizing this data would involve:
- Checking how many students there are.
- Seeing what kinds of data are in each column (e.g., names are text, ages are numbers).
- Looking at the average age and score of the students.
- Identifying if there are any missing names or scores.
- Knowing how many students got each score (if it's a test).

In pandas, you use methods like `info()`, `describe()`, and `head()` to quickly get this information.

- ------------------------------------------------------------------------------------------------------------------------------------------
- ------------------------------------------------------------------------------------------------------------------------------------------

### Markdown 7
- The describe() method in pandas provides a statistical summary of the DataFrame's numeric columns. It helps you understand the distribution and central tendencies of the data. Here’s what it does:

##### What describe() Provides:
- Count: The number of non-null entries in each numeric column.
- Mean: The average value of the numeric entries.
- Standard Deviation (std): A measure of the dispersion or spread of the data around the mean.
- Min: The smallest value in the column.
- 25% (First Quartile): The value below which 25% of the data falls.
- 50% (Median or Second Quartile): The middle value that divides the data into two equal halves.
- 75% (Third Quartile): The value below which 75% of the data falls.
- Max: The largest value in the column.


### Markdown 8
### Difference b/w describe() and info()  method
##### describe() Method:
- Purpose: Provides a statistical summary of numeric columns only.
- Output: Includes statistics like count, mean, standard deviation, min, max, and quartiles (25%, 50%, 75%).
- Usage: Best for understanding the distribution and central tendencies of numeric data.

#### info() Method:
- Purpose: Provides a summary of all columns, including data types and non-null counts.
- output: Shows:
- Column Names: The names of all columns.
- Data Types: The data type of each column (e.g., int64, float64, object).
- Non-Null Counts: The number of non-null entries in each column.
- Memory Usage: The memory usage of the DataFrame.
- Usage: Best for getting an overview of the data types, presence of missing values, and general structure of the DataFrame.