<a href="https://colab.research.google.com/github/kamrunsumi/Analysis_Crime_Chicago_Data_using_PYSpark/blob/master/PYSpark(Chicago_Crime).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1><center> PySpark </center></h1>

### Kamrun Sumi
### Dataset: Crimes - 2001 to present - Dashboard 
### Dataset link: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present-Dashboard/5cd6-ry5g
### googlecloud link:https://storage.googleapis.com/pyspark_assignment/crime_chicago.csv



# Crime in Chicago
Crime in Chicago is one of the most popular topics for all data scientists due to the availability of huge amounts of
publicly available high quality data set which will help all data scientists to explore.
In this notebook, I am going to explore more about crime in Chicago and try to answer the some questions: 

# Importing data and setting Py Spark Environment.
First, I am going to import libraries required and read the data into py spark.
## Install Java, Spark, and Findspark

This installs Apache Spark 2.2.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
%%bashhttp://apache.osuosl.org/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz  
apt-get install openjdk-8-jdk-headless -qq > /dev/null
[ ! -e "$(basename spark-2.4.4-bin-hadoop2.7.tgz)" ] && wget  
tar xf spark-2.4.4-bin-hadoop2.7.tgz
pip install -q findspark

## Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

## Start a SparkSession



In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

# get a spark session. 
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
from pyspark import SparkConf, SparkContext
import collections

from pyspark.sql.types import  (TimestampType)
from pyspark.sql.functions import format_number
from pyspark.sql.functions import month

### Create Dataframe in Spark!

In [None]:
#crimes = spark.read.csv('file:///Users/manha/Desktop/DSE-6000(S)/Class/Assignment/HW2/Crime_Chicago.csv', 
                  # inferSchema = True,
                   # header = True)



! [ ! -e "$(basename crime_chicago.csv)" ] && wget https://storage.googleapis.com/pyspark_assignment/crime_chicago.csv
crimes= spark.read.csv('crime_chicago.csv',
                     header= True, 
                     inferSchema = True)

print(crimes.columns)

['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location', 'Historical Wards 2003-2015', 'Zip Codes', 'Community Areas', 'Census Tracts', 'Wards', 'Boundaries - ZIP Codes', 'Police Districts', 'Police Beats']


# Overview of the dataset
### Shape of the dataframe
In this section, I will check how many columns and rows in our dataset are and what their types are. I will also check the first five rows of all the dataset.

In [None]:
print(" The crimes dataframe has {} records".format(crimes.count()))

print(" The crimes dataframe has {} columns".format(len(crimes.columns)))


 The crimes dataframe has 6963286 records
 The crimes dataframe has 30 columns


In [None]:
crimes.show(5)


+--------+-----------+--------------------+--------------------+----+---------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+--------------------------+---------+---------------+-------------+-----+----------------------+----------------+------------+
|      ID|Case Number|                Date|               Block|IUCR|   Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|Historical Wards 2003-2015|Zip Codes|Community Areas|Census Tracts|Wards|Boundaries - ZIP Codes|Police Districts|Police Beats|
+--------+-----------+--------------------+--------------------+----+---------------+--------------------+--------------------+------+--------+----+--------+----+------------

### Checking datatypes of all the columns

In [None]:

crimes.dtypes

[('ID', 'int'),
 ('Case Number', 'string'),
 ('Date', 'string'),
 ('Block', 'string'),
 ('IUCR', 'string'),
 ('Primary Type', 'string'),
 ('Description', 'string'),
 ('Location Description', 'string'),
 ('Arrest', 'boolean'),
 ('Domestic', 'boolean'),
 ('Beat', 'int'),
 ('District', 'int'),
 ('Ward', 'int'),
 ('Community Area', 'int'),
 ('FBI Code', 'string'),
 ('X Coordinate', 'int'),
 ('Y Coordinate', 'int'),
 ('Year', 'int'),
 ('Updated On', 'string'),
 ('Latitude', 'double'),
 ('Longitude', 'double'),
 ('Location', 'string'),
 ('Historical Wards 2003-2015', 'int'),
 ('Zip Codes', 'int'),
 ('Community Areas', 'int'),
 ('Census Tracts', 'int'),
 ('Wards', 'int'),
 ('Boundaries - ZIP Codes', 'int'),
 ('Police Districts', 'int'),
 ('Police Beats', 'int')]

In [None]:
crimes.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: boolean (nullable = true)
 |-- Domestic: boolean (nullable = true)
 |-- Beat: integer (nullable = true)
 |-- District: integer (nullable = true)
 |-- Ward: integer (nullable = true)
 |-- Community Area: integer (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: integer (nullable = true)
 |-- Y Coordinate: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Updated On: string (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- Historical Wards 2003-2015: integer (nullable = true)
 |-- Zip Codes: integer (nullable = tr

**We can quickly see the dataset of a specific column instead of whole dataset as shown below. We select one column using "select" option and shows the first 10 rows of "Block" column.**



In [None]:
crimes.select("Block").show(10)

+--------------------+
|               Block|
+--------------------+
|  012XX W ADDISON ST|
|     005XX E 32ND ST|
|017XX S MICHIGAN AVE|
|   005XX N DRAKE AVE|
|     055XX S WOOD ST|
| 011XX N MONITOR AVE|
|116XX S CARPENTER ST|
|056XX S MARYLAND AVE|
|   033XX S MORGAN ST|
|  033XX N PULASKI RD|
+--------------------+
only showing top 10 rows



# Cleaning step

### Removing rows with null values

In [None]:
crimes=crimes.dropna()
print(crimes.count())

6260600


### Dropping duplicate rows

In [None]:
crimes.drop_duplicates().count()

6260600


### Checking if column name has space in between words
For simplicity, the best practice is to remove spaces between words and substitute with underscore.
Change column names.

In [None]:
crimes = crimes.toDF('ID','Case_Number','Date','Block','IUCR','Primary_Type','Description','Location_Description','Arrest',
             'Domestic','Beat','District','Ward','Community_Area','FBI_Code','X_Coordinate','Y_Coordinate','Year',
             'Updated_On','Latitude','Longitude','Location','Historical_Wards_2003-2015','Zip_Codes',
             'Community_Areas','Census_Tracts','Wards','Boundaries_ZIP_Codes','Police_Districts','Police_Beats')



### Dropping extra columns I do not need

In [None]:
crimes = crimes.drop('IUCR','Ward','FBI_Code','Updated_On','Location_Description','Historical_Wards_2003-2015','Zip_Codes',
                            'Community_Areas','Census_Tracts','Wards','Boundaries_-_ZIP_Codes','Police_Beats','Beat')

print(crimes.columns)

['ID', 'Case_Number', 'Date', 'Block', 'Primary_Type', 'Description', 'Arrest', 'Domestic', 'District', 'Community_Area', 'X_Coordinate', 'Y_Coordinate', 'Year', 'Latitude', 'Longitude', 'Location', 'Boundaries_ZIP_Codes', 'Police_Districts']


### Changing the column names to lower case

In [None]:
for col in crimes.columns:
    crimes = crimes.withColumnRenamed(col, col.lower())
print(crimes.columns)

['id', 'case_number', 'date', 'block', 'primary_type', 'description', 'arrest', 'domestic', 'district', 'community_area', 'x_coordinate', 'y_coordinate', 'year', 'latitude', 'longitude', 'location', 'boundaries_zip_codes', 'police_districts']


### Renaming columns in dataframe 

In the below code chunk, we are renaming a column in the dataset

In [None]:
crimes = crimes.withColumnRenamed("primary_type", "crime_type")
print(crimes.columns)

['id', 'case_number', 'date', 'block', 'crime_type', 'description', 'arrest', 'domestic', 'district', 'community_area', 'x_coordinate', 'y_coordinate', 'year', 'latitude', 'longitude', 'location', 'boundaries_zip_codes', 'police_districts']


### Converting dates to timestamp format 
The Date column is in string format. Let's change it to timestamp format using the user defined functions.
withColumn helps to create a new column and we remove one or more columns with drop.I will convert dates to timestamp format which will help me a lot later on. 



In [None]:
# Let's check the initial date column
crimes.select("date").show(5, truncate = False)
                           

+----------------------+
|date                  |
+----------------------+
|09/06/2019 11:55:00 PM|
|09/06/2019 11:55:00 PM|
|09/06/2019 11:53:00 PM|
|09/06/2019 11:52:00 PM|
|09/06/2019 11:50:00 PM|
+----------------------+
only showing top 5 rows



In [None]:
# Converting date column to timestamps format. 
from datetime import datetime
from pyspark.sql.functions import col,udf
from pyspark.sql.types import  (TimestampType)
myfunc =  udf(lambda x: datetime.strptime(x, '%m/%d/%Y %I:%M:%S %p'), TimestampType())
crimes = crimes.withColumn('date_time', myfunc(col('date'))).drop("date")

crimes.select(crimes["date_time"]).show(5)

+-------------------+
|          date_time|
+-------------------+
|2019-09-06 23:55:00|
|2019-09-06 23:55:00|
|2019-09-06 23:53:00|
|2019-09-06 23:52:00|
|2019-09-06 23:50:00|
+-------------------+
only showing top 5 rows



## Data analysis

### Calculating statistics of numeric and string columns
I will calculate the statistics of string and numeric columns using describe. When we select more than one columns, we have to pass the column names as a python list.

In [None]:
crimes.select(["crime_type","latitude","longitude","x_coordinate", "y_coordinate","year"]).describe().show()
 


+-------+-----------------+-------------------+-------------------+------------------+------------------+------------------+
|summary|       crime_type|           latitude|          longitude|      x_coordinate|      y_coordinate|              year|
+-------+-----------------+-------------------+-------------------+------------------+------------------+------------------+
|  count|          6260600|            6260600|            6260600|           6260600|           6260600|           6260600|
|   mean|             null|  41.84172226834295| -87.67146301860431|1164612.0075676453|1885614.3465322813|2009.3613969907037|
| stddev|             null|0.08618143754829734|0.05872770724926101|16137.106412858006|31336.436030737987| 4.932564528722894|
|    min|            ARSON|       41.644588224|      -87.934324301|           1092711|           1813896|              2001|
|    max|WEAPONS VIOLATION|       42.022709624|      -87.524529859|           1205116|           1951573|              2019|


**The above numbers do not look very good. Let's round them using format_number from PySpark's functions.**



In [None]:
result = crimes.select(["crime_type","latitude","longitude","x_coordinate","y_coordinate","year"]).describe()
result.select(result['summary'],
              format_number(result["crime_type"].cast('float'),2).alias("crime_type"),
              format_number(result["latitude"].cast('float'),2).alias("latitude"),
              format_number(result["longitude"].cast('float'),2).alias("longitude"),
              format_number(result["x_coordinate"].cast('float'),2).alias("x_coordinate"),
              format_number(result["y_coordinate"].cast('float'),2).alias("y_coordinate"),
              format_number(result["year"].cast('float'),2).alias("year")
             ).show()

+-------+------------+------------+------------+------------+------------+------------+
|summary|  crime_type|    latitude|   longitude|x_coordinate|y_coordinate|        year|
+-------+------------+------------+------------+------------+------------+------------+
|  count|6,260,600.00|6,260,600.00|6,260,600.00|6,260,600.00|6,260,600.00|6,260,600.00|
|   mean|        null|       41.84|      -87.67|1,164,612.00|1,885,614.38|    2,009.36|
| stddev|        null|        0.09|        0.06|   16,137.11|   31,336.44|        4.93|
|    min|        null|       41.64|      -87.93|1,092,711.00|1,813,896.00|    2,001.00|
|    max|        null|       42.02|      -87.52|1,205,116.00|1,951,573.00|    2,019.00|
+-------+------------+------------+------------+------------+------------+------------+



### Using PySpark's functions we can calculate various statistics
#### Calculating average for latitude value and longitude value.


In [None]:
from pyspark.sql.functions import mean
#Latitude
crimes.select(mean("latitude")).alias("Mean latitude").show()
# We can also use the agg method to calculate the average.
crimes.agg({"latitude":"avg"}).show()

#Longitude
from pyspark.sql.functions import mean
crimes.select(mean("longitude")).alias("Mean longitude").show()
crimes.agg({"longitude":"avg"}).show()

+-----------------+
|    avg(latitude)|
+-----------------+
|41.84172226834295|
+-----------------+

+-----------------+
|    avg(latitude)|
+-----------------+
|41.84172226834295|
+-----------------+

+------------------+
|    avg(longitude)|
+------------------+
|-87.67146301860431|
+------------------+

+------------------+
|    avg(longitude)|
+------------------+
|-87.67146301860431|
+------------------+



#### Calculating maximum and minimum values.


In [None]:
from pyspark.sql.functions import max,min
crimes.select(max("x_coordinate"),min("x_coordinate")).show()
crimes.select(max("y_coordinate"),min("y_coordinate")).show()

+-----------------+-----------------+
|max(x_coordinate)|min(x_coordinate)|
+-----------------+-----------------+
|          1205116|          1092711|
+-----------------+-----------------+

+-----------------+-----------------+
|max(y_coordinate)|min(y_coordinate)|
+-----------------+-----------------+
|          1951573|          1813896|
+-----------------+-----------------+



### Finding the location with maximum and minimum crimes




In [None]:
df=crimes.select('crime_type','year','location','domestic')

min_val = df.agg({'crime_type':'max'}).collect()[0][0] 
max_val=df.agg({'crime_type':'min'}).collect()[0][0] 
df.where(df['crime_type']==max_val).show()
df.where(df['crime_type']==min_val).show()

+----------+----+--------------------+--------+
|crime_type|year|            location|domestic|
+----------+----+--------------------+--------+
|     ARSON|2019|(41.902718047, -8...|   false|
|     ARSON|2019|(41.693044285, -8...|   false|
|     ARSON|2019|(41.87828717, -87...|   false|
|     ARSON|2019|(41.747878602, -8...|   false|
|     ARSON|2019|(41.943217294, -8...|   false|
|     ARSON|2019|(41.777261623, -8...|   false|
|     ARSON|2019|(41.889894019, -8...|   false|
|     ARSON|2019|(41.881863028, -8...|   false|
|     ARSON|2019|(41.80223336, -87...|   false|
|     ARSON|2019|(41.820914589, -8...|   false|
|     ARSON|2019|(41.974208525, -8...|   false|
|     ARSON|2019|(41.739662581, -8...|    true|
|     ARSON|2019|(41.878741684, -8...|   false|
|     ARSON|2019|(41.840255102, -8...|   false|
|     ARSON|2019|(41.782492024, -8...|   false|
|     ARSON|2019|(41.782492024, -8...|   false|
|     ARSON|2019|(41.854358931, -8...|   false|
|     ARSON|2019|(41.897736958, -8...|  

## Crime_Types

### How many primary crime types are there?
"distinct" function returns unique elements form the column.

In [None]:
print(" There are {} types of crimes".format(crimes.select("crime_type").distinct().count()))


 There are 34 types of crimes


#### The code shown below shows a list of  primary crime types.

In [None]:
crimes.groupBy("crime_type").count().show()


+--------------------+-------+
|          crime_type|  count|
+--------------------+-------+
|OFFENSE INVOLVING...|  42264|
|            STALKING|   3250|
|PUBLIC PEACE VIOL...|  45307|
|           OBSCENITY|    574|
|NON-CRIMINAL (SUB...|      9|
|               ARSON|  10049|
|            GAMBLING|  13299|
|   CRIMINAL TRESPASS| 180123|
|             ASSAULT| 392527|
|      NON - CRIMINAL|     38|
|LIQUOR LAW VIOLATION|  12107|
| MOTOR VEHICLE THEFT| 285412|
|               THEFT|1323216|
|             BATTERY|1146235|
|             ROBBERY| 236983|
|            HOMICIDE|   9470|
|           RITUALISM|     14|
|    PUBLIC INDECENCY|    156|
| CRIM SEXUAL ASSAULT|  24894|
|   HUMAN TRAFFICKING|     57|
+--------------------+-------+
only showing top 20 rows



### How many thefts are there in Chicago?



In [None]:
print(" There are total {} thefts in Chicago".format(crimes.where(crimes["crime_type"] == "THEFT").count()))
#crimes.where(crimes["crime_type"] == "THEFT").count()

 There are total 1323216 thefts in Chicago


### How many thefts were arrested in Chicago?

In [None]:

crimes.filter((crimes["crime_type"]=="THEFT") & (crimes["arrest"]=="Yes")).count()

151962

### How many thefts were not arrested in Chicago?

In [None]:
crimes.filter((crimes["crime_type"]=="THEFT") & (crimes["arrest"]=="No")).count()

1171254

### How many domestic assualts were there?

In [None]:
crimes.filter((crimes["crime_type"] == "ASSAULT") & (crimes["domestic"] == "True")).count()


86481

### How many sex offenders were arrested?


In [None]:
columns = ["crime_type", 'arrest', 'domestic']

crimes.where((crimes["crime_type"] == "SEX OFFENSE") & (crimes["arrest"] == "true")).select(columns).count()

6677

### Percentage of crime_type NARCOTICS 

In [None]:
crimes.filter(crimes.crime_type.rlike("NARCOTICS")).count()/crimes.count() * 100

10.309235536530045

### Organizing the dataset based on case no in descending order.

In [None]:
df = ['crime_type', 'description', 'arrest', 'domestic','case_number']
crimes.orderBy(crimes['case_number'].desc()).select(df).show(5)

+---------------+-------------------+------+--------+-----------+
|     crime_type|        description|arrest|domestic|case_number|
+---------------+-------------------+------+--------+-----------+
|          THEFT|     $500 AND UNDER| false|   false|  ZZZ199957|
|          THEFT|     $500 AND UNDER| false|   false|   ZZ740108|
|CRIMINAL DAMAGE|CRIMINAL DEFACEMENT| false|   false|   ZZ696090|
|          THEFT|      FROM BUILDING| false|   false|   ZZ591134|
|CRIMINAL DAMAGE|CRIMINAL DEFACEMENT| false|   false|   ZZ572583|
+---------------+-------------------+------+--------+-----------+
only showing top 5 rows



### We can use limit function to restrict the number of columns we want to retrieve from a dataframe.

In [None]:
crimes.select(df).limit(10). show(truncate = True)


+-----------------+--------------------+------+--------+-----------+
|       crime_type|         description|arrest|domestic|case_number|
+-----------------+--------------------+------+--------+-----------+
|            THEFT|           OVER $500| false|   false|   JC426287|
|          BATTERY|DOMESTIC BATTERY ...| false|    true|   JC423179|
|          ROBBERY|STRONGARM - NO WE...| false|   false|   JC423159|
|  CRIMINAL DAMAGE|          TO VEHICLE| false|   false|   JC423127|
|          ASSAULT|              SIMPLE| false|   false|   JC423169|
|          BATTERY|DOMESTIC BATTERY ...| false|    true|   JC423147|
|  CRIMINAL DAMAGE|          TO VEHICLE| false|    true|   JC423139|
|          BATTERY| AGGRAVATED: HANDGUN| false|   false|   JC423190|
|        NARCOTICS|   POSS: BARBITUATES|  true|   false|   JC423150|
|CRIMINAL TRESPASS|             TO LAND| false|   false|   JC423140|
+-----------------+--------------------+------+--------+-----------+



### Creating new column from the existing column



In [None]:
lat_max = crimes.agg({"latitude" : "max"}).collect()[0][0]

print("The maximum latitude values is {}".format(lat_max))


The maximum latitude values is 42.022709624


Let's subtract each latitude value from the maximum latitude.

In [None]:
df = crimes.withColumn("difference_from_max_lat",lat_max - crimes["latitude"])
df.select(["latitude", "difference_from_max_lat"]).show(5)

+------------+-----------------------+
|    latitude|difference_from_max_lat|
+------------+-----------------------+
| 41.94714472|    0.07556490400000371|
|41.836069707|    0.18663991700000082|
|41.858643312|    0.16406631200000277|
|41.890645892|    0.13206373199999888|
|41.792959315|    0.22975030900000348|
+------------+-----------------------+
only showing top 5 rows



## Time analysis

In [None]:
Year_df = crimes.sort((crimes.year).desc()).show(5)

+--------+-----------+--------------------+---------------+--------------------+------+--------+--------+--------------+------------+------------+----+------------+-------------+--------------------+--------------------+----------------+-------------------+
|      id|case_number|               block|     crime_type|         description|arrest|domestic|district|community_area|x_coordinate|y_coordinate|year|    latitude|    longitude|            location|boundaries_zip_codes|police_districts|          date_time|
+--------+-----------+--------------------+---------------+--------------------+------+--------+--------+--------------+------------+------------+----+------------+-------------+--------------------+--------------------+----------------+-------------------+
|11819900|   JC423169|     055XX S WOOD ST|        ASSAULT|              SIMPLE| false|   false|       7|            67|     1165296|     1867844|2019|41.792959315|-87.669413975|(41.792959315, -8...|                  23|      

### How many number of crimes per year happened?

In [None]:
crimes.groupBy("Year").count().show()


+----+------+
|Year| count|
+----+------+
|2003|469975|
|2007|433870|
|2018|261620|
|2015|256691|
|2006|443606|
|2013|305034|
|2014|272492|
|2019|174665|
|2004|465253|
|2012|333730|
|2009|384149|
|2016|265275|
|2001|  3874|
|2005|448002|
|2010|368291|
|2011|349459|
|2008|418037|
|2017|263001|
|2002|343576|
+----+------+



###  Number of crimes by month
we can use the month function from PySpark's functions to get the numeric month.

In [None]:

monthcrimes = crimes.withColumn("Month",month("Date_time"))
monthCounts = monthcrimes.select("Month").groupBy("Month").count()
monthCounts.show()

+-----+------+
|Month| count|
+-----+------+
|   12|460614|
|    1|465172|
|    6|574350|
|    3|495075|
|    5|578497|
|    9|541374|
|    4|508047|
|    8|596286|
|    7|602919|
|   10|541577|
|   11|486447|
|    2|410242|
+-----+------+



# Advantages and distadvantages

The advantages and disadvantages of PySpark and Pandas are given below.

1. csv file:
Reading CSV file with Pandas is easy. You easily read CSV files with read_csv(). However, CSV is not supported natively by Spark. You have to use a separate library : spark-csv.

2. Counting:
PysparkDF.count() and pandasDF.count() do not give the same output. The first one returns the number of rows, and the second one returns the number of non NA/null observations for each column. It is noted that Spark doesn’t support .shape yet which is very often used in Pandas.

3. Visualization:
In Pandas, in order to obtain a tabular view of the content of a DataFrame, pandasDF.head(5), or pandasDF.tail(5) are typically used whereas,in IPython Notebooks, it displays a nice array with continuous borders. In Spark, you have sparkDF.head(5), but it has an ugly output. You should prefer sparkDF.show(5) instead. Note that you cannot view the last lines (.tail() does no exist yet, because long to do in distributed environment).

4. Data conversion to Dataframe:
Unless data is converted to dataframe, it is not convenient to use Pyspark. Since schema view of data is not available in Pyspark, dataframe becomes essential.

5. Wrangling:
‘[ ]’ operator can be used for features engineering in Pandas. In Spark, we can’t use this in PySpark since  DataFrames are immutable. You should use .withColumn().

6. Online support:	
Limited online support for RDD in general. Most of support is available for Scala and dataframes in Spark.

7. Syntax: 
Syntax is not as similar as expected between pyspark and pandas.

In summary, PySpark is faster than Pandas but it is very expensive. PySpark is better for big dataset when a cloud/cluster can be utilized. On the other hand, Panda is slower than PySpark. Panda is suitable for smaller dataset where the code can be run on a personal computer. 

