## What is Pandas DataFrame?

Pandas DataFrame is a Two-Dimensional data structure, an immutable, heterogeneous tabular data structure with labeled axes rows, and columns. pandas Dataframe consists of three components principal, data, rows, and columns. Pandas is built on the NumPy library and written in languages like Python, Cython, and C.

## DataFrame Features

- DataFrames support named rows & columns (you can also provide names to rows)
- Supports heterogeneous collections of data.
- DataFrame labeled axes (rows and columns).
- Can perform arithmetic operations on rows and columns.
- Supporting reading flat files like CSV, Excel, and JSON and also reads SQL tables’s
- Handling of missing data.

## Pandas DataFrame vs PySpark DataFrame

In very simple words Pandas run operations on a single machine whereas PySpark runs on multiple machines. If you are working on a Machine Learning application with larger datasets, PySpark is the best to use as it perform operations many times(100x) faster than Pandas.

PySpark is also very well used in Data Science and Machine Learning community as there are many widely used data science libraries written in Python including NumPy, TensorFlow. Also, PySpark is used due to its efficient processing of large datasets. PySpark has been used by many organizations like Walmart, Trivago, Sanofi, Runtastic, and many more.

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities. Using PySpark we can run applications parallelly on the distributed cluster (multiple nodes) or even on a single node.

Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.

Spark was basically written in Scala and later on due to its industry adaptation, its API PySpark was released for Python using Py4J. Py4J is a Java library that is integrated within PySpark and allows python to dynamically interface with JVM objects, hence to run PySpark you also need Java to be installed along with Python, and Apache Spark.

Additionally, For the development, you can use Anaconda distribution (widely used in the Machine Learning community) which comes with a lot of useful tools like Spyder IDE, Jupyter notebook to run PySpark applications.

### How to Decide Between Pandas vs PySpark

Below are a few considerations when choosing PySpark over Pandas.

- If your data is huge and grows significantly over the years and you wanted to improve your processing time.
- If you want fault-tolerant.
- ANSI SQL compatibility.
- Language to choose (Spark supports Python, Scala, Java & R)
- When you want Machine-learning capability.
- Would like to read Parquet, Avro, Hive, Casandra, Snowflake e.t.c
- If you wanted to stream the data and process it in real-time.

In [None]:
## Installing Pandas

1 Install pandas using Python pip Command

In [None]:
# Install pandas using pip
pip install pandas
(or)
pip3 install pandas

In [None]:
2 Install Pandas using Anaconda conda Command
Anaconda distribution comes with a conda tool that is used to install/upgrade/downgrade most of the python and other packages.

In [None]:
# Install pandas using conda
conda install pandas

Upgrade Pandas to Latest or Specific Version
In order to upgrade pandas to the latest or specific version, you can use either pip install command or conda install if you are using Anaconda distribution. Before you start to upgrade, you the following command to know the current version of pandas installed.

Below are statements to upgrade pandas. Depending on how you wanted to update, use either pip or conda statements.



In [None]:
# Using pip to upgrade pandas
pip install --upgrade pandas

# Alternatively you can also try
python -m pip install --upgrade pandas

# Upgrade pandas to specific version
pip install pandas==specific-higher-version

# Use conda update
conda update pandas

#Upgrade to specific version
conda update pandas==0.14.0

Run Pandas Hello World Example
7.1 Run Pandas From Command Line
If you installed Anaconda, open the Anaconda command line or open the python shell/command prompt and enter the following lines to get the version of pandas, to learn more follow the links from the left-hand side of the pandas tutorial.

In [1]:
!pip show pandas

Name: pandas
Version: 1.4.2
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: c:\users\administrator.lab-student\anaconda3\lib\site-packages
Requires: numpy, python-dateutil, pytz
Required-by: datashader, holoviews, hvplot, seaborn, statsmodels, text-normalizer, xarray


In [2]:
import pandas
print(pandas.__version__)

1.4.2


In [3]:
!pip show numpy

Name: numpy
Version: 1.21.6
Summary: NumPy is the fundamental package for array computing with Python.
Home-page: https://www.numpy.org
Author: Travis E. Oliphant et al.
Author-email: 
License: BSD
Location: c:\users\administrator.lab-student\anaconda3\lib\site-packages
Requires: 
Required-by: astropy, bkcharts, bokeh, Bottleneck, daal4py, datashader, datashape, gensim, h5py, holoviews, hvplot, imagecodecs, imageio, imbalanced-learn, Keras-Preprocessing, matplotlib, mkl-fft, mkl-random, moviepy, numba, numexpr, opencv-python, opt-einsum, pandas, patsy, pyerfa, PyWavelets, scikit-image, scikit-learn, scipy, seaborn, statsmodels, tables, tensorboard, tensorflow, tifffile, torchvision, xarray, xgboost, yellowbrick


In [4]:
import numpy
numpy.__version__

'1.21.6'

In [8]:
### How do we create a dataframe

#### Creating a dataframe from ditionary

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Name':['Anil','Amita','Hina'],
    'Age': [23,25,20],
    'Course': [np.NaN, 'BSc', 'BBA'],
    'Marks': [78,89,98]
})

print(df)
print()
print(df.isnull().sum())

    Name  Age Course  Marks
0   Anil   23    NaN     78
1  Amita   25    BSc     89
2   Hina   20    BBA     98

Name      0
Age       0
Course    1
Marks     0
dtype: int64


In [23]:
### Creating a dataframe using a list

ser1 = pd.Series(['Anil','Amita','Hina'])
ser2 = pd.Series([np.NaN, 'BSc', 'BBA'])
ser3 = pd.Series([78,89,98])
ser4 = pd.Series([23,25,20])
df = pd.DataFrame(zip(ser1,ser2,ser3,ser4))
df

Unnamed: 0,0,1,2,3
0,Anil,,78,23
1,Amita,BSc,89,25
2,Hina,BBA,98,20


In [33]:
df = pd.DataFrame(zip(pd.Series(['Anil','Amita','Hina']),
                      (pd.Series([np.NaN, 'BSc', 'BBA']))
                     ))
df

Unnamed: 0,0,1
0,Anil,
1,Amita,BSc
2,Hina,BBA


In [None]:
Name = ['Anil','Amita','Hina']

In [32]:
df = pd.DataFrame(pd.Series(list(Name)))
df

Unnamed: 0,0
0,"[Anil, Amita, Hina]"


In [28]:
pd.Series(Name)

pandas.core.series.Series

In [31]:
type(pd.Series(['Anil','Amita','Hina']))

pandas.core.series.Series

In [35]:
df = pd.concat([ser1,ser2], axis=1)
df

Unnamed: 0,0,1
0,Anil,
1,Amita,BSc
2,Hina,BBA


In [40]:
name = ['Anil','Amita','Hina']
course= [np.NaN, 'BSc', 'BBA']

ser1 = pd.Series(name, name='Name')
ser2 = pd.Series(course, name='Course')
df = pd.concat([ser1,ser2],axis=1)
df

Unnamed: 0,Name,Course
0,Anil,
1,Amita,BSc
2,Hina,BBA


In [45]:
df = pd.read_table('sample1.txt', header=None, names=['SN','Name','Course'])
df

Unnamed: 0,SN,Name,Course
0,0,Anil,
1,1,Amita,BSc
2,2,Hina,BBA


In [47]:
df = pd.DataFrame({
    'Name':['Anil','Amita','Hina'],
    'Age': [23,25,20],
    'Course': [np.NaN, 'BSc', 'BBA'],
    'Marks': [78,89,98]
}, columns=)

print(df)


    Name  Age Course  Marks
0   Anil   23    NaN     78
1  Amita   25    BSc     89
2   Hina   20    BBA     98


In [53]:
### Reading data from Excel

df = pd.read_excel("C:\\Users\\Administrator.LAB-STUDENT\\Desktop\\Superstore.xls")
df

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2,0.00,41.9136
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3,0.00,219.5820
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2,0.00,6.8714
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.0310
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,CA-2014-110422,2014-01-21,2014-01-23,Second Class,TB-21400,Tom Boeckenhauer,Consumer,United States,Miami,...,33180,South,FUR-FU-10001889,Furniture,Furnishings,Ultra Door Pull Handle,25.2480,3,0.20,4.1028
9990,9991,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,FUR-FU-10000747,Furniture,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.9600,2,0.00,15.6332
9991,9992,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,TEC-PH-10003645,Technology,Phones,Aastra 57i VoIP phone,258.5760,2,0.20,19.3932
9992,9993,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,OFF-PA-10004041,Office Supplies,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.6000,4,0.00,13.3200


In [52]:
print('this is python\'s book')

this is python's book


In [54]:
df = pd.read_excel(r"C:\Users\Administrator.LAB-STUDENT\Desktop\Superstore.xls")
df

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2,0.00,41.9136
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3,0.00,219.5820
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2,0.00,6.8714
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.0310
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,CA-2014-110422,2014-01-21,2014-01-23,Second Class,TB-21400,Tom Boeckenhauer,Consumer,United States,Miami,...,33180,South,FUR-FU-10001889,Furniture,Furnishings,Ultra Door Pull Handle,25.2480,3,0.20,4.1028
9990,9991,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,FUR-FU-10000747,Furniture,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.9600,2,0.00,15.6332
9991,9992,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,TEC-PH-10003645,Technology,Phones,Aastra 57i VoIP phone,258.5760,2,0.20,19.3932
9992,9993,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,OFF-PA-10004041,Office Supplies,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.6000,4,0.00,13.3200


In [55]:
df = pd.read_excel("C:/Users/Administrator.LAB-STUDENT/Desktop/Superstore.xls")
df

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2,0.00,41.9136
1,2,CA-2016-152156,2016-11-08,2016-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3,0.00,219.5820
2,3,CA-2016-138688,2016-06-12,2016-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2,0.00,6.8714
3,4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.0310
4,5,US-2015-108966,2015-10-11,2015-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,CA-2014-110422,2014-01-21,2014-01-23,Second Class,TB-21400,Tom Boeckenhauer,Consumer,United States,Miami,...,33180,South,FUR-FU-10001889,Furniture,Furnishings,Ultra Door Pull Handle,25.2480,3,0.20,4.1028
9990,9991,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,FUR-FU-10000747,Furniture,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.9600,2,0.00,15.6332
9991,9992,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,TEC-PH-10003645,Technology,Phones,Aastra 57i VoIP phone,258.5760,2,0.20,19.3932
9992,9993,CA-2017-121258,2017-02-26,2017-03-03,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,OFF-PA-10004041,Office Supplies,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.6000,4,0.00,13.3200


In [67]:
df = pd.read_excel("Superstore.xls", sheet_name=["Orders","People"])
df['People']

Unnamed: 0,Person,Region
0,Anna Andreadi,West
1,Chuck Magee,East
2,Kelly Williams,Central
3,Cassandra Brandow,South


In [57]:
os.getcwd()

'C:\\Users\\Administrator.LAB-STUDENT\\DATA SCIENCE CONTENTS\\DATA ANALYTICS Contents\\PANDAS'

In [79]:
df = pd.read_csv('sample1.txt', sep='\t', header=None, names=['SN','Name','Course'],
                na_values=['Not available','Nothing'])
df

Unnamed: 0,SN,Name,Course
0,0,Anil,
1,1,,BSc
2,2,Hina,BBA


In [75]:
df.isna().sum().sum()

0