Skip to content

A Collection of Cheatsheets,Tips and Reference Guide for Data Analysis Tools - Pandas, Pypolars,Pyspark,etc

Notifications You must be signed in to change notification settings

jcharistech/cheatsheets-datascience

Repository files navigation

CheatSheets-DataScience

A collection of cheatsheets,tips,Reference Guide for several data analysis libraries and frameworks

  • Pandas
  • Pyspark
  • Pypolars
  • Dask
  • DataTables
  • etc

PySpark

  • PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.
  • https://spark.apache.org/docs/latest/api/python/reference/index.html

Installation

pip install pyspark

Load PySpark & Launch Spark UI

from pyspark import SparkContext
# Create a Spark Context
sc = SparkContext(master="local[2]")
sc 

Create Spark Session

from pyspark.sql import SparkSession
# Spark Session used for DF
spark = SparkSession.builder.appName("SampleSparkApp").getOrCreate()

Basics For Data Analysis

Read CSV without header/schema

df = spark.read.csv("data/dataset.csv")

Preview/Show the first 5 row

df.show(5)

Read CSV with header

df = spark.read.csv("data/dataset.csv",header=True)
df.show(5)

Read CSV with header and Infer Schema

df = spark.read.csv("data/dataset.csv",header=True,infer_schema=True)
df.show(5)

Get First Row

df.first()

Get First 5 Row using .head

df.head(5)

Check For the column names

df.columns

Check For the Datatypes

df.dtypes

Get the Schema

df.printSchema()
Check the number of rows
df.count()

Check for the number of columns

len(df.columns)

Get the shape (rows,cols)

print(df.count(),len(df.columns))

Descriptive Summary

df.describe().show()

Get Descriptive Summary of A Column

df.describe('age').show()

Selection of Columns

df.select('Col1').show()
Tip Column Selection is irrespective of Case of Column
df.select('COL1').show()
df.select('col1').show()

Select Multiple columnns

df.select('Age','Category').show()

Note on Selection

  • Using Bracket notation only prints the name of the column no the data within the column

Dot Notation *

df.Age

Conditions & Filter

Using filter method

# Method 1: Using Filter
df.filter(df['Age'] == 25).show()
# or
df.filter(df.Age == 25).show()
# Filter on ==,>, <, >=, <= condition
df = df.filter(df.Age > 25)
Filter using Multiple conditions
  • It requires parentheses around each condition
df = df.filter((df.Age > 25) & (df.Gender == 'M'))
Filter against a list of allowed values
df = df.filter(col('first_name').isin([3, 4, 7]))

Sort By Ascending or Descending Order

df.orderBy(df.age.asc())
df.orderBy(df.age.desc())

Using where method

# Where: Method 1
df.where(df['sex'] == 'f').show()
# Filter and show only a selected column
df.where(df['sex'] == 'f').select('Age','Sex','Category').show(5)

Working with Columns

# Add a Column
df.withColumn('Alb_by_10',df['ALB'] * 10).show()
# Persist the addition
df2 = df.withColumn('Alb_by_10',df['ALB'] * 10)
# Rename a column
df = df.withColumnRenamed('dob', 'date_of_birth')
# Select the columns to keep, optionally rename some
df = df.select(
    'name',
    'age',
    F.col('dob').alias('date_of_birth'),)

# Remove columns
df = df.drop('monthly_salary', 'salary_for_year')

# Keep all the columns which also occur in another dataset
df = df.select(*(F.col(c) for c in df2.columns))

# Batch Rename/Clean Columns
for col in df.columns:
    df = df.withColumnRenamed(col, col.lower().replace(' ', '_').replace('-', '_'))

Pypolars

  • PyPolars is an open-source Python data frame library similar to Pandas. PyPolars utilizes all the available cores of the CPU and hence performs the computations faster
  • Ideally, PyPolars is used when the data is too big for Pandas and too small for Spark
  • https://github.com/pola-rs/polars

Pypolars API

  • Eager API: it is very similar to that of Pandas, and the results are produced just after the execution is completed.
  • Lazy API: it is very similar to Spark, where a map or plan is formed upon execution of a query. Then the execution is executed parallelly across all the cores of the CPU.

Installation

pip install py-polars

Import Package

import pypolars as pl

Reading CSV

df = pl.read_csv("data/diamonds.csv")

Check Type

type(df)
pypolars.frame.DataFrame

Get First Rows / Head

df.head()

Get Last 10 Rows

df.tail(10)

Check For Shape

df.shape

Check For Datatypes

df.dtypes

[pypolars.datatypes.Float64,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Utf8,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Int64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64,
 pypolars.datatypes.Float64]

Check For Column Names

df.columns

Get Descriptive Summary

# there is no .describe() hence use pandas
df.to_pandas().describe()

Selection of Columns

# Select Columns By Col Name
# same in pandas:: df['carat']
df['col1']

# Select Columns By Index
df[0]

# Select By Index Position for columns
# same in pandas:: df[0] 
df.select_at_idx(0)

# Select Multiple Columns
df[['col1','col2']]

Selection of Rows

# Select rows 0 to 3 and all columns
# same in pandas:: df.iloc[0:3]
df[0:3]

# Slicing 0 to length of rows
# same in pandas:: df.iloc[0:3]
df.slice(0,3)

# Select from 3 rows of a column
df[0:3,"col1"]

# Select from different rows of a column
df[[0,4,10],"col1"]

# Select from different rows of a multiple columns
df[[0,4,10],["col1","col2"]]

Value Counts & Unique Values

df['col1'].value_counts()

# Get Unique Values
df['cut'].unique()

Filter & Conditions

# Boolean Indexing & Condition & Find 
# Find all rows/datapoint with cut as ideal
df[df['cut'] == "Ideal"]

# Check For Price More than A Value (300)
df[df['price'] > 300]

Applying Functions

# Method 1
df['col1'].apply(round)

# Method 2 Using Lambda
df['col1'].apply(lambda x : round(x))

Using Lazy Execution

import pypolars.lazy as plazy
# Define your fxn
def cust_mapcut_values(s: plazy.Series) -> plazy.Series:
    return s.apply(lambda x: mydict[x])
# Apply Function
output = df.lazy().with_column(plazy.col('cut').map(cust_mapcut_values))

# Execute and Collect Results
output.collect()

Group By

# same in pandas:: df.groupby('cut')['price'].sum()
df.groupby('cut').select('price').sum()

# Selecting Multiple COlumns From Groupby
df.groupby('cut').select(['price','carat']).sum()

# Select Every Column + Fxn
df.groupby(['cut','color']).select_all().first()

Create A Series

# Create A Series
x = pl.Series('X',['A','B','C'])
y = pl.Series('Y',[3,5,4])

Plot with Matplotlib

# Load Data Vis Pkgs
import matplotlib.pyplot as plt
# Bar Chart Using Matplotlib
plt.bar(x,y)

Dask

Dask can parallelise operations equally easily on a computer as on a server. It automatically figures out the cores in a machine and intelligently distributes workload.

Installation

pip install dask 
pip install “dask[complete]”

Data-Forge [Data Analysis with Javascript]

Data Analysis with Javascript - DataForge.js

// import package
const dataForge = require('data-forge');require('data-forge-fs')

Create A DataFrame

const df = new dataForge.DataFrame({
    columnNames:["id","name","sex","age"],
    rows:[
        [1,"Jesse","male",25],
        [2,"Jane","female",25],
        [3,"Mark","male",20],
        [4,"Peter","male",55],
        [5,"Paula","female",35],

    ]
})
df.toArray()
df.toString()

Using a list of objects

let data = [{ A: 1, B: 10 }, { A: 2, B: 20 }, { A: 3, B: 30 }];
let df2 = new dataForge.DataFrame(data);
df2.toString()

Reading CSV

const df = dataForge.readFileSync('data/sodata.csv').parseCSV();

Display DF

df.toArray()

Display as Rows

df.toRows()

Display as Index and Value Pairs

df.toPairs()

Display as JSON

df.toJSON()

Get Head as a nice format

df.head(5).toString()

Get tail

df.tail(5).toString()

Get Datatypes

df.detectTypes().toString()

Get Values

df.detectValues().toString()

Get Column Names

df.getColumnNames()

Rename a column /make sure to set const to let

let df2 = df.renameSeries({"title":"text"})
df2.getColumnNames()

SELECTION OF ROWS

  • pandas:: df.iloc[10]
 let row10 = df.at(10)
row10
  • pandas:: df.loc[10:20]
df.between(10,20).toString()

Using skip and take

df.skip(10).take(20).toString()

SELECTION OF COLUMNS/Same as series

  • pandas df['col1'] or df.col1

method 1: using getSeries

df.getSeries("tags").toString()

method 2: using deflate

df.deflate(row=>row.tags).toString()

method 3: using subset

df.subset(['tags','title']).toString()

Deleting Columns

df.dropSeries('tags').head(5).toString()

Delete Multiple Columns

df.dropSeries(['title','tags']).head(5).toArray()

Filtering

df.where(row=>row['python'] == 0.0).toString()

Modification/Transform

df.transformSeries({title: value => value.toUpperCase()}).toString()

Method 2

let pyHeight = df.getSeries("python")
pyHeight.select(value=> value + 2.5).toString()

df.withSeries("Height",pyheight2)

Groupby

     group=>{
         return{
             Sex:group.first().sex,
             Count:group.count()
         }
     }).inflate();

Thanks For Your Time

  • Jesus Saves @JCharisTech
  • Jesse E.Agbe(JCharis)

.

  • Jesus Saves @JCharisTech
  • By Jesse E.Agbe(JCharis)

About

A Collection of Cheatsheets,Tips and Reference Guide for Data Analysis Tools - Pandas, Pypolars,Pyspark,etc

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published