<div style="display: flex; justify-content: center; align-items: center; height: 120px;">
    <div style="border: 4px solid blue; padding: 20px; text-align: center; font-size: 28px; font-weight: bold; color: blue; background-color: #f0f8ff; width: 50%;">
        <strong>Data Gathering</strong>
    </div>
</div>


Data gathering is the first and most crucial step in the data analysis process. It involves collecting relevant data from various sources to ensure a solid foundation for analysis. The quality, accuracy, and completeness of the gathered data significantly impact the overall results of any data-driven project.

## <span style="color:green">Why is Data Gathering Important?</span>
- Ensures that analysis is based on **reliable and relevant information**.
- Helps in **making informed decisions** based on structured data.
- Reduces errors by eliminating **incomplete, outdated, or irrelevant data**.
- Forms the base for **data preprocessing, wrangling, and further analysis**.

## <span style="color:purple">Types of Data</span>
Data can be collected in different formats, and it is broadly categorized into:
1. **Structured Data** - Organized data stored in databases, spreadsheets, or CSV files.
2. **Unstructured Data** - Text, images, videos, and social media content that require preprocessing.
3. **Semi-Structured Data** - Data with some structure but not entirely organized, such as JSON, XML, or logs.

## <span style="color:blue">Sources of Data Gathering</span>
- **Manual Entry:** Collecting data through surveys, interviews, or direct observation.
- **Databases & Data Warehouses:** Extracting structured data from SQL databases or cloud storage.
- **APIs (Application Programming Interfaces):** Fetching data from online services like Twitter, Google Maps, or financial markets.
- **Web Scraping:** Extracting data from websites using tools like BeautifulSoup and Scrapy.
- **Sensors & IoT Devices:** Gathering real-time data from connected devices and machines.

## <span style="color:darkorange">Challenges in Data Gathering</span>
- **Data Inconsistency:** Variations in formats, structures, and missing values.
- **Data Security & Privacy Issues:** Compliance with GDPR, HIPAA, or other regulations.
- **Handling Large Volumes of Data:** Managing **big data** efficiently.
- **Access Restrictions:** Some sources require authentication or permissions.


# <span style="color:blue">Data Gathering</span>

## <span style="color:green">What is Data Analysis?</span>
**<span style="color:darkorange">Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.</span>**

## <span style="color:purple">Data Analysis Process</span>
**<span style="color:red">Asking Right Questions → Data Wrangling → Exploratory Data Analysis → Drawing Conclusions → Communicating Results</span>**

## <span style="color:blue">(Step 1) : Asking Questions</span>
**1. <span style="color:darkcyan">What features will contribute to my analysis?</span>**  
**2. <span style="color:darkcyan">What features are not important for my analysis?</span>**  
**3. <span style="color:darkcyan">Which of the features have a strong correlation?</span>**  
**4. <span style="color:darkcyan">Do I need data preprocessing?</span>**  
**5. <span style="color:darkcyan">What kind of feature manipulation/engineering is required?</span>**  

### <span style="color:orange">How can I ask better questions?</span>
**- Subject Matter Expertise & Experience**  

---

## <span style="color:green">(Step 2) Data Wrangling/Munging</span>
**<span style="color:brown">Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for analytics.</span>**  

 

### <span style="color:blue">Steps in Data Wrangling:</span>
**1. <span style="color:darkblue">Gathering Data</span>**  
**2. <span style="color:darkblue">Assessing Data</span>**  
**3. <span style="color:darkblue">Cleaning Data</span>**  

### <span style="color:purple">(1st) Gathering Data:</span>
- <span style="color:teal">CSV Files</span>  
- <span style="color:teal">API</span>  
- <span style="color:teal">Web Scraping</span>  
- <span style="color:teal">Databases</span>  

### <span style="color:purple">(2nd) Assessing Data:</span>
- Finding the number of rows/columns (`shape`)  
- Data Types of Various Columns (`info()`)  
- Checking for missing values (`info()`)  
- Checking for duplicate Data (`is_unique`)  
- Memory occupied by the dataset (`info`)  
- High-level mathematical overview of the data (`describe`)  

### <span style="color:purple">(3rd) Cleaning Data:</span>
- Handling Missing Data (e.g., mean, median)  
- Removing Duplicates (`drop_duplicates`)  
- Correcting Data Types (`astype`)  

---

## <span style="color:darkgreen">(Step 3) Exploratory Data Analysis (EDA)</span>

### <span style="color:blue">Explore</span>  
### <span style="color:blue">Augment</span>  

#### <span style="color:darkred">(3a) Exploring Data:</span>
- Finding Correlation and Covariance  
- Performing Univariate and Bivariate Analysis  
- Plotting Graphs (Data Visualization)  

#### <span style="color:darkred">(3b) Augmenting Data:</span>
- Removing Outliers (Boxplot)  
- Merging DataFrames  
- Adding new columns  

**<span style="color:darkmagenta">These operations are collectively called Feature Engineering.</span>**  

---

## <span style="color:darkorange">(Step 4) Drawing Conclusions:</span>
- Inferential Statistics  
- Descriptive Statistics  
- Machine Learning  

---

## <span style="color:darkblue">(Step 5) Communicating Results / Data Storytelling</span>
**<span style="color:darkred">In Person, Reports, Blog Posts, PPTs/Slide decks</span>**


In [1]:
import warnings
warnings.filterwarnings('ignore')

# 1.Importing Pandas

In [2]:
import pandas as pd

# 2.Opening a local csv File

In [10]:
df = pd.read_csv('aug_train.csv')
df

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


# 3. Opening a csv file from URL

In [5]:
import requests
from io import StringIO

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers={"User-Agent":
"Mozilla/5.0(Macintosh;Intel Mac OSX10.14;rv:66.0)Gecko/20100101 Firefox/66.0"}
req = requests.get(url, headers=headers)
data = StringIO(req.text)

pd.read_csv(data)

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA
...,...,...
189,Paraguay,SOUTH AMERICA
190,Peru,SOUTH AMERICA
191,Suriname,SOUTH AMERICA
192,Uruguay,SOUTH AMERICA


# 4. Sep Parameter

In [7]:
pd.read_csv('movie_titles_metadata.tsv',sep='\t',names=['sno',
                'name','release_year','rating','votes','genres'])

Unnamed: 0,sno,name,release_year,rating,votes,genres
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']
...,...,...,...,...,...,...
612,m612,watchmen,2009,7.8,135229.0,['action' 'crime' 'fantasy' 'mystery' 'sci-fi'...
613,m613,xxx,2002,5.6,53505.0,['action' 'adventure' 'crime']
614,m614,x-men,2000,7.4,122149.0,['action' 'sci-fi']
615,m615,young frankenstein,1974,8.0,57618.0,['comedy' 'sci-fi']


# 5. Index_col parameter

In [8]:
pd.read_csv('aug_train.csv',index_col='enrollee_id')

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


# 6. Header parameter

In [9]:
pd.read_csv('test.csv',header=1)

Unnamed: 0,0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
1,2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
2,3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
3,4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0


# 7. use_cols parameter

In [10]:
pd.read_csv('aug_train.csv',usecols=['enrollee_id',
                        'gender','education_level'])

Unnamed: 0,enrollee_id,gender,education_level
0,8949,Male,Graduate
1,29725,Male,Graduate
2,11561,,Graduate
3,33241,,Graduate
4,666,Male,Masters
...,...,...,...
19153,7386,Male,Graduate
19154,31398,Male,Graduate
19155,24576,Male,Graduate
19156,5756,Male,High School


# 8. Skiprows/nrows Parameter

In [12]:
pd.read_csv('aug_train.csv',nrows=100)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,12081,city_65,0.802,Male,Has relevent experience,Full time course,Graduate,STEM,9,50-99,Pvt Ltd,1,33,0.0
96,7364,city_160,0.920,,No relevent experience,Full time course,High School,,2,100-500,Pvt Ltd,1,142,0.0
97,11184,city_74,0.579,,No relevent experience,Full time course,Graduate,STEM,2,100-500,Pvt Ltd,1,34,0.0
98,7016,city_65,0.802,Male,Has relevent experience,no_enrollment,Graduate,STEM,6,50-99,Pvt Ltd,2,14,1.0


# 9. Encoding parameter

In [13]:
pd.read_csv('zomato.csv',encoding='latin-1')

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.584450,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9546,5915730,NamlÛ± Gurme,208,ÛÁstanbul,"Kemankeô Karamustafa Paôa Mahallesi, RÛ±htÛ±...",Karakí_y,"Karakí_y, ÛÁstanbul",28.977392,41.022793,Turkish,...,Turkish Lira(TL),No,No,No,No,3,4.1,Green,Very Good,788
9547,5908749,Ceviz AÛôacÛ±,208,ÛÁstanbul,"Koôuyolu Mahallesi, Muhittin íìstí_ndaÛô Cadd...",Koôuyolu,"Koôuyolu, ÛÁstanbul",29.041297,41.009847,"World Cuisine, Patisserie, Cafe",...,Turkish Lira(TL),No,No,No,No,3,4.2,Green,Very Good,1034
9548,5915807,Huqqa,208,ÛÁstanbul,"Kuruí_eôme Mahallesi, Muallim Naci Caddesi, N...",Kuruí_eôme,"Kuruí_eôme, ÛÁstanbul",29.034640,41.055817,"Italian, World Cuisine",...,Turkish Lira(TL),No,No,No,No,4,3.7,Yellow,Good,661
9549,5916112,Aôôk Kahve,208,ÛÁstanbul,"Kuruí_eôme Mahallesi, Muallim Naci Caddesi, N...",Kuruí_eôme,"Kuruí_eôme, ÛÁstanbul",29.036019,41.057979,Restaurant Cafe,...,Turkish Lira(TL),No,No,No,No,4,4.0,Green,Very Good,901


# 10. Skip bad lines

In [16]:
pd.read_csv('BX-Books.csv', sep=';', 
            encoding="latin-1", on_bad_lines='skip')

Unnamed: 0,0195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg
0,0002005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
1,0060973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
2,0374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
3,0393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...
4,0399135782,The Kitchen God's Wife,Amy Tan,1991,Putnam Pub Group,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...,http://images.amazon.com/images/P/0399135782.0...
...,...,...,...,...,...,...,...,...
271354,0440400988,There's a Bat in Bunk Five,Paula Danziger,1988,Random House Childrens Pub (Mm),http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...,http://images.amazon.com/images/P/0440400988.0...
271355,0525447644,From One to One Hundred,Teri Sloat,1991,Dutton Books,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...,http://images.amazon.com/images/P/0525447644.0...
271356,006008667X,Lily Dale : The True Story of the Town that Ta...,Christine Wicker,2004,HarperSanFrancisco,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...,http://images.amazon.com/images/P/006008667X.0...
271357,0192126040,Republic (World's Classics),Plato,1996,Oxford University Press,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...,http://images.amazon.com/images/P/0192126040.0...


# 11. dtypes parameter

In [17]:
pd.read_csv('aug_train.csv',dtype={'target':int}).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  int32  
dtypes: float64(1), int32(1), int64(2), obj

# 12. Handling Dates

In [18]:
pd.read_csv('IPL Matches 2008-2020.csv',
            parse_dates=['date']).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   id               816 non-null    int64         
 1   city             803 non-null    object        
 2   date             816 non-null    datetime64[ns]
 3   player_of_match  812 non-null    object        
 4   venue            816 non-null    object        
 5   neutral_venue    816 non-null    int64         
 6   team1            816 non-null    object        
 7   team2            816 non-null    object        
 8   toss_winner      816 non-null    object        
 9   toss_decision    816 non-null    object        
 10  winner           812 non-null    object        
 11  result           812 non-null    object        
 12  result_margin    799 non-null    float64       
 13  eliminator       812 non-null    object        
 14  method           19 non-null     object   

In [20]:
def rename(name):
    if name == "Royal Challengers Bangalore":
        return "RCB"
    else:
        return name

In [21]:
rename("Royal Challengers Bangalore")

'RCB'

# 13. Convertors

In [22]:
pd.read_csv('IPL Matches 2008-2020.csv',converters={'team1':rename})

Unnamed: 0,id,city,date,player_of_match,venue,neutral_venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,eliminator,method,umpire1,umpire2
0,335982,Bangalore,2008-04-18,BB McCullum,M Chinnaswamy Stadium,0,RCB,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,N,,Asad Rauf,RE Koertzen
1,335983,Chandigarh,2008-04-19,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",0,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,N,,MR Benson,SL Shastri
2,335984,Delhi,2008-04-19,MF Maharoof,Feroz Shah Kotla,0,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,N,,Aleem Dar,GA Pratapkumar
3,335985,Mumbai,2008-04-20,MV Boucher,Wankhede Stadium,0,Mumbai Indians,Royal Challengers Bangalore,Mumbai Indians,bat,Royal Challengers Bangalore,wickets,5.0,N,,SJ Davis,DJ Harper
4,335986,Kolkata,2008-04-20,DJ Hussey,Eden Gardens,0,Kolkata Knight Riders,Deccan Chargers,Deccan Chargers,bat,Kolkata Knight Riders,wickets,5.0,N,,BF Bowden,K Hariharan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
811,1216547,Dubai,2020-09-28,AB de Villiers,Dubai International Cricket Stadium,0,RCB,Mumbai Indians,Mumbai Indians,field,Royal Challengers Bangalore,tie,,Y,,Nitin Menon,PR Reiffel
812,1237177,Dubai,2020-11-05,JJ Bumrah,Dubai International Cricket Stadium,0,Mumbai Indians,Delhi Capitals,Delhi Capitals,field,Mumbai Indians,runs,57.0,N,,CB Gaffaney,Nitin Menon
813,1237178,Abu Dhabi,2020-11-06,KS Williamson,Sheikh Zayed Stadium,0,RCB,Sunrisers Hyderabad,Sunrisers Hyderabad,field,Sunrisers Hyderabad,wickets,6.0,N,,PR Reiffel,S Ravi
814,1237180,Abu Dhabi,2020-11-08,MP Stoinis,Sheikh Zayed Stadium,0,Delhi Capitals,Sunrisers Hyderabad,Delhi Capitals,bat,Delhi Capitals,runs,17.0,N,,PR Reiffel,S Ravi


# 14. na_values parameter

In [23]:
pd.read_csv('aug_train.csv',na_values=['Male',])

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,,No relevent experience,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


# 15. Loading a huge dataset in chunks

In [26]:
dfs = pd.read_csv('aug_train.csv',chunksize=5000)

In [28]:
for chunks in dfs:
    print(chunks.shape)

(5000, 14)
(5000, 14)
(4158, 14)


In [30]:
pd.read_json('https://api.exchangerate-api.com/v4/latest/INR')

Unnamed: 0,provider,WARNING_UPGRADE_TO_V6,terms,base,date,time_last_updated,rates
INR,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,1.0000
AED,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,0.0422
AFN,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,0.8710
ALL,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,1.1100
AMD,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,4.5600
...,...,...,...,...,...,...,...
XPF,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,1.3400
YER,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,2.8500
ZAR,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,0.2160
ZMW,https://www.exchangerate-api.com,https://www.exchangerate-api.com/docs/free,https://www.exchangerate-api.com/terms,INR,2025-02-04,1738627201,0.3220


# Working with SQL`


```python
!pip install mysql-connector-python
```

In [33]:
import mysql.connector

In [35]:
conn = mysql.connector.connect(host='localhost',
                        user='root',password='',database='')

In [37]:
df = pd.read_sql_query("SELECT * FROM smartphones",conn)

In [38]:
df

Unnamed: 0,brand_name,model,price,rating,has_5g,has_nfc,has_ir_blaster,processor_brand,num_cores,processor_speed,...,refresh_rate,num_rear_cameras,num_front_cameras,os,primary_camera_rear,primary_camera_front,extended_memory_available,extended_upto,resolution_width,resolution_height
0,oneplus,OnePlus 11 5G,54999,89.0,True,True,False,snapdragon,8.0,3.20,...,120,3,1.0,android,50.0,16.0,0,,1440,3216
1,oneplus,OnePlus Nord CE 2 Lite 5G,19989,81.0,True,False,False,snapdragon,8.0,2.20,...,120,3,1.0,android,64.0,16.0,1,1024.0,1080,2412
2,samsung,Samsung Galaxy A14 5G,16499,75.0,True,False,False,exynos,8.0,2.40,...,90,3,1.0,android,50.0,13.0,1,1024.0,1080,2408
3,motorola,Motorola Moto G62 5G,14999,81.0,True,False,False,snapdragon,8.0,2.20,...,120,3,1.0,android,50.0,16.0,1,1024.0,1080,2400
4,realme,Realme 10 Pro Plus,24999,82.0,True,False,False,dimensity,8.0,2.60,...,120,3,1.0,android,108.0,16.0,0,,1080,2412
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
825,oppo,Oppo Find X6,69990,89.0,True,True,False,snapdragon,8.0,3.20,...,120,3,1.0,android,50.0,32.0,0,,1080,2400
826,motorola,Motorola Moto Edge S30 Pro,34990,83.0,True,False,False,snapdragon,8.0,3.00,...,120,3,1.0,android,64.0,16.0,0,,1080,2460
827,honor,Honor X8 5G,14990,75.0,True,False,False,snapdragon,8.0,2.20,...,60,3,1.0,android,48.0,8.0,1,1024.0,720,1600
828,poco,POCO X4 GT 5G (8GB RAM + 256GB),28990,85.0,True,True,True,dimensity,8.0,2.85,...,144,3,1.0,android,64.0,16.0,0,,1080,2460


In [39]:
# pd.read_excel('output.xlsx',index_col='Unnamed: 0')

In [40]:
# pd.read_excel('output.xlsx',sheet_name='Sheet_name_2')

### Reading text files

In [41]:
# pd.read_csv('question_answer_pairs.txt',sep='\t')

## JSON (JavaScript Object Notation) / SQL (Structured Query Language)

**JavaScript Object Notation (JSON)** is a standard text-based format for representing structured data based on JavaScript object syntax. It is commonly used for transmitting data in web applications (e.g., sending some data from the server to the client, so it can be displayed on a web page, or vice versa).

**SQL** stands for Structured Query Language. SQL lets you access and manipulate databases. SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standardization (ISO) in 1987.


In [42]:
import pandas as pd
pd.read_json("train.json")

Unnamed: 0,id,cuisine,ingredients
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes..."
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g..."
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g..."
3,22213,indian,"[water, vegetable oil, wheat, salt]"
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe..."
...,...,...,...
39769,29109,irish,"[light brown sugar, granulated sugar, butter, ..."
39770,11462,italian,"[KRAFT Zesty Italian Dressing, purple onion, b..."
39771,2238,irish,"[eggs, citrus fruit, raisins, sourdough starte..."
39772,41882,chinese,"[boneless chicken skinless thigh, minced garli..."


### Pandas Export
<b> to csv,
    to excel,
    to html,
    to json,
    to sql</b>

## to_csv

In [45]:
df = pd.read_csv("deliveriess.csv")
df.head()

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
0,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,1,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
1,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,2,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
2,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,3,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,4,0,4,,,
3,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,4,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
4,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,5,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,2,2,,,


In [46]:
temp_df = df.groupby('batsman')['batsman_runs'].sum().reset_index()

In [47]:
temp_df.to_csv('batsman_runs.csv',index=False)

In [48]:
df.pivot_table(index='batsman',columns='bowling_team',
               values='batsman_runs',aggfunc='sum'
              ).to_csv('batsman_vs_team.csv')

## to_excel

In [49]:
df

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
0,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,1,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
1,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,2,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
2,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,3,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,4,0,4,,,
3,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,4,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
4,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,5,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,2,2,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
179073,11415,2,Chennai Super Kings,Mumbai Indians,20,2,RA Jadeja,SR Watson,SL Malinga,0,...,0,0,0,0,1,0,1,,,
179074,11415,2,Chennai Super Kings,Mumbai Indians,20,3,SR Watson,RA Jadeja,SL Malinga,0,...,0,0,0,0,2,0,2,,,
179075,11415,2,Chennai Super Kings,Mumbai Indians,20,4,SR Watson,RA Jadeja,SL Malinga,0,...,0,0,0,0,1,0,1,SR Watson,run out,KH Pandya
179076,11415,2,Chennai Super Kings,Mumbai Indians,20,5,SN Thakur,RA Jadeja,SL Malinga,0,...,0,0,0,0,2,0,2,,,


In [50]:
temp_df = df.groupby('batsman')['batsman_runs'].sum().reset_index()

In [52]:
temp_df.to_excel("batsman_runs.xlsx")

In [53]:
temp_df.to_excel("output.xlsx",sheet_name='batsman_runs')

In [54]:
temp_df2 = df.pivot_table(index='batsman',
            columns='bowling_team',values='batsman_runs',aggfunc='sum')

# to_html

In [58]:
df.query('batsman_runs == 6').pivot_table(index='over',columns='ball',
                            values='batsman_runs',
                            aggfunc='count').to_html('sixes_heatmap.html')

## to_json

In [59]:
df.groupby(['batting_team','batsman']
          )['batsman_runs'].sum().unstack().to_json('ipl.json')

## to_sql
```python
!pip install pymysql
!pip install sqlalchemy
```

In [89]:
!pip install pymysql
!pip install sqlalchemy

Collecting pymysql
  Downloading PyMySQL-1.1.1-py3-none-any.whl.metadata (4.4 kB)
Downloading PyMySQL-1.1.1-py3-none-any.whl (44 kB)
   ---------------------------------------- 0.0/45.0 kB ? eta -:--:--
   ---------------------------------------- 0.0/45.0 kB ? eta -:--:--
   --------- ------------------------------ 10.2/45.0 kB ? eta -:--:--
   ------------------ --------------------- 20.5/45.0 kB 162.5 kB/s eta 0:00:01
   ------------------ --------------------- 20.5/45.0 kB 162.5 kB/s eta 0:00:01
   ------------------------------------ --- 41.0/45.0 kB 217.9 kB/s eta 0:00:01
   ---------------------------------------- 45.0/45.0 kB 184.7 kB/s eta 0:00:00
Installing collected packages: pymysql
Successfully installed pymysql-1.1.1


In [90]:
import pymysql
from sqlalchemy import create_engine

In [93]:
engine = create_engine("mysql+pymysql://root:000000000000@localhost/something")
# {root}:{password}@{url}/{database}
df.to_sql('ipl_delivery', con = engine, if_exists = 'append')

179078

In [94]:
temp_df.to_sql('batsman_runs', con = engine, if_exists = 'append')

516

In [95]:
six_df = df.query('batsman_runs == 6').pivot_table(index='over',
    columns='ball',values='batsman_runs',aggfunc='count')

In [96]:
six_df.to_sql('six_heatmap', con = engine, if_exists = 'append')

20

## Through API

In [97]:
import pandas as pd
import requests

In [98]:
response = requests.get('https://api.themoviedb.org/3/movie/top_rated?api_key=
yourapikey8&language=en-US&page=1')

In [99]:
temp_df = pd.DataFrame(response.json()['results'])[['id','title','overview',
                'release_date','popularity','vote_average','vote_count']]

In [100]:
df.head()

Unnamed: 0,match_id,inning,batting_team,bowling_team,over,ball,batsman,non_striker,bowler,is_super_over,...,bye_runs,legbye_runs,noball_runs,penalty_runs,batsman_runs,extra_runs,total_runs,player_dismissed,dismissal_kind,fielder
0,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,1,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
1,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,2,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
2,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,3,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,4,0,4,,,
3,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,4,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,0,0,,,
4,1,1,Sunrisers Hyderabad,Royal Challengers Bangalore,1,5,DA Warner,S Dhawan,TS Mills,0,...,0,0,0,0,0,2,2,,,


In [101]:
df = pd.DataFrame()

In [102]:
df

In [104]:
# for i in range(1,429):
#     response = requests.get('https://api.themoviedb.org/3/movie/top_rated?api_key=8265bd1679663a7ea12ac168da84d2e8&language=en-US&page={}'.format(i))
#     temp_df = pd.DataFrame(response.json()['results'])[['id','title','overview','release_date','popularity','vote_average','vote_count']]
#     df = df.append(temp_df,ignore_index=True)

In [105]:
import requests
import pandas as pd
import time

API_KEY = "Yourapikey"
BASE_URL = "https://api.themoviedb.org/3/movie/top_rated"
params = {"api_key": API_KEY, "language": "en-US", "page": 1}

# Get total pages dynamically
response = requests.get(BASE_URL, params=params)
total_pages = response.json().get('total_pages', 1)  # Default to 1 if missing

movie_data = []

for i in range(1, min(total_pages + 1, 429)):  # Limit to 429 pages
    params["page"] = i
    response = requests.get(BASE_URL, params=params)
    
    if response.status_code != 200:
        print(f"Error on page {i}: {response.status_code}")
        continue

    results = response.json().get('results', [])
    for movie in results:
        movie_data.append({
            "id": movie["id"],
            "title": movie["title"],
            "overview": movie["overview"],
            "release_date": movie["release_date"],
            "popularity": movie["popularity"],
            "vote_average": movie["vote_average"],
            "vote_count": movie["vote_count"]
        })

    time.sleep(0.5)  # Prevent hitting rate limits

# Create DataFrame once
df = pd.DataFrame(movie_data)


In [106]:
df

Unnamed: 0,id,title,overview,release_date,popularity,vote_average,vote_count
0,278,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,1994-09-23,172.502,8.700,27645
1,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",1972-03-14,161.181,8.700,20962
2,240,The Godfather Part II,In the continuing saga of the Corleone crime f...,1974-12-20,94.195,8.570,12650
3,424,Schindler's List,The true story of how businessman Oskar Schind...,1993-12-15,103.983,8.600,16103
4,389,12 Angry Men,The defense and the prosecution have rested an...,1957-04-10,59.573,8.546,8840
...,...,...,...,...,...,...,...
8555,10862,Bounce,"Buddy Amaral, a successful and self-absorbed L...",2000-11-15,8.871,5.800,390
8556,6477,Alvin and the Chipmunks,A struggling songwriter named Dave Seville fin...,2007-12-13,42.350,5.810,4463
8557,451877,I Think We're Alone Now,"After a catastrophe destroys most of humanity,...",2018-09-14,12.446,5.808,401
8558,11519,1941,"In the days after the attack on Pearl Harbor, ...",1979-12-14,13.779,5.800,623


In [107]:
df.to_csv('web_movies.csv')

In [None]:
24b65dcfa9f3dd00294ab050f60d5c01

In [1]:
import requests
import pandas as pd
import time

API_KEY = "24b65dcfa9f3dd00294ab050f60d5c01"
BASE_URL = "https://api.themoviedb.org/3/movie/top_rated"
params = {"api_key": API_KEY, "language": "en-US", "page": 1}

# Get total pages dynamically
response = requests.get(BASE_URL, params=params)
total_pages = response.json().get('total_pages', 1)  # Default to 1 if missing

movie_data = []

for i in range(1, min(total_pages + 1, 429)):  # Limit to 429 pages
    params["page"] = i
    response = requests.get(BASE_URL, params=params)
    
    if response.status_code != 200:
        print(f"Error on page {i}: {response.status_code}")
        continue

    results = response.json().get('results', [])
    for movie in results:
        movie_data.append({
            "id": movie["id"],
            "title": movie["title"],
            "overview": movie["overview"],
            "release_date": movie["release_date"],
            "popularity": movie["popularity"],
            "vote_average": movie["vote_average"],
            "vote_count": movie["vote_count"]
        })

    time.sleep(0.5)  # Prevent hitting rate limits

# Create DataFrame once
df = pd.DataFrame(movie_data)

In [2]:
df

Unnamed: 0,id,title,overview,release_date,popularity,vote_average,vote_count
0,278,The Shawshank Redemption,Imprisoned in the 1940s for the double murder ...,1994-09-23,165.375,8.700,27647
1,238,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",1972-03-14,152.677,8.688,20965
2,240,The Godfather Part II,In the continuing saga of the Corleone crime f...,1974-12-20,91.379,8.570,12651
3,424,Schindler's List,The true story of how businessman Oskar Schind...,1993-12-15,112.597,8.600,16105
4,389,12 Angry Men,The defense and the prosecution have rested an...,1957-04-10,60.437,8.500,8842
...,...,...,...,...,...,...,...
8555,9384,Starsky & Hutch,"Join uptight David Starsky and laid-back Ken ""...",2004-03-05,13.686,5.810,2327
8556,6477,Alvin and the Chipmunks,A struggling songwriter named Dave Seville fin...,2007-12-13,33.355,5.810,4464
8557,433627,7 Days in Entebbe,"In 1976, four hijackers take over an Air Franc...",2018-03-15,12.156,5.800,417
8558,12118,Police Academy 3: Back in Training,"When police funding is cut, the Governor annou...",1986-03-20,14.797,5.809,1233


In [5]:
df[df['title'] == ]

Unnamed: 0,id,title,overview,release_date,popularity,vote_average,vote_count


In [8]:
df.sample(10)

Unnamed: 0,id,title,overview,release_date,popularity,vote_average,vote_count
6314,10192,Shrek Forever After,A bored and domesticated Shrek pacts with deal...,2010-05-16,96.929,6.378,7419
3251,2064,While You Were Sleeping,A transit worker pulls commuter Peter off rail...,1995-04-21,21.667,7.018,1828
1684,2292,Clerks,Convenience and video store clerks Dante and R...,1994-10-19,9.956,7.4,2513
1274,5143,Hannah and Her Sisters,"Between two Thanksgivings, Hannah's husband fa...",1986-02-07,10.673,7.505,1056
6317,10951,Gorgeous,"When Ah Bu, a girl from a small fishing town i...",1999-02-12,17.673,6.377,370
8544,70706,Very Good Girls,Two New York City girls make a pact to lose th...,2013-01-22,8.684,5.8,431
2543,20345,The Bird with the Crystal Plumage,An American writer living in Rome witnesses an...,1970-02-27,7.601,7.2,767
4764,11371,The Score,An aging thief hopes to retire and live off hi...,2001-07-13,15.281,6.694,1661
4257,796,Cruel Intentions,"Slaking a thirst for dangerous games, Kathryn ...",1999-03-05,59.108,6.801,3292
443,223,Rebecca,Story of a young woman who marries a fascinati...,1940-03-23,15.454,7.891,1778
