- Atabey Kaygun (kaygun@itu.edu.tr)
- Lectures: Tuesdays 14:30-17:30 (D106)
Data science spans an interdisciplinary landscape positioned at the intersection of mathematics, statistics, machine learning, and computer science. It leverages the methodologies and tools intrinsic to these domains to extract valuable information, and uncover meaningful insights from data. Our course is going to dive deeply into the foundational mathematical principles that underlie traditional statistical and machine learning models commonly employed within this domain. Tailored specifically for students pursuing studies in fundamental sciences, the central aim of this course is to equip them with the necessary skills to effectively utilize and deploy these algorithms across a diverse range of practical applications.
The following books are freely available on the web.
- T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning.
- M. P. Deisenroth, A. A. Faisal, and C. S. Ong. Mathematics For Machine Learning.
- I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning.
The books I listed are mostly theoretical. But for the computational homeworks you may need the following:
- A. Müller and S. Guido Introduction to Machine Learning with Python: A Guide for Data Scientists 1st Edition.
- A. Geron. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems
- J. VanderPlas, Python Data Science Handbook.
Also, there are excellent resources on the web. I would recommend:
- edX
- MIT-X
- Kaggle Courses on Python, Pandas, Visualization, Data cleaning, and GIS Data.
- UCI datasets
- Google dataset explorer
- Registry of open datasets on AWS
- Open MRI, MEG, EEG, iEEG, and ECoG data
- Open physiological datasets
- NCBI datasets
- USGS datasets
- Global climate data
- NASDAQ data
The course is an applied data analysis class. This means the course requires a degree of proficiency of computational tools from which you are going to be responsible.
- git and GitHub
- Python programming language (version 3.10 or higher)
- Anaconda or Pip package managers
- Jupyter notebook system
- Markdown markup language
Installing and maintaining these systems on your machine is your responsibility. I can't help you if something doesn't work. You will need to figure it out on your own. If you can't install these systems on your machine you may try to use an online service:
I will make all of the course related announcement on İTÜ's course management system NINOVA. I will post the grades on NINOVA as well. So, do check it regularly.
I receive approximately 50 e-mails per day. So, if you need to contact me, use the subject ``MAT388E'' in your e-mails. Spend some time structuring your e-mail with grammatically correct sentences in Turkish or in English. Be polite, direct, and concise. State what you need in the first two sentences. Sign your e-mails with your name and student number. If I can't figure out who you are and what you need within 30 seconds of opening your message, I will delete your e-mail with no response. You are hereby warned.
Your performance is going to be judged via 4 homework assignments posted on the course github page and one final project that you need to write from scratch. Each homework is 10 points, the final project proposal is worth 15 points, and the final project is worth 40 points. Your total assessment for the course will be evaluated as follows:
If you receive 0 (missing HWs are graded as 0) any 2 of the homeworks, or if your total from homeworks is less than 35% you'll get a VF. If your final is less than 25%, or your total is less than 35% you'll receive an F. Note that the conditions for receiving a VF are both necessary and sufficient, while the conditions for receiving an F are only sufficient. This means you may still get an F with a higher score than 35% depending on the distribution of the scores.
Assessment | Deadline | Weight |
---|---|---|
Github link | Feb 20 | 5% |
Homework 1 | March 5 | 10% |
Homework 2 | March 26 | 10% |
Final Project Proposal | Apr 2 | 15% |
Homework 3 | Apr 16 | 10% |
Homework 4 | May 6 | 10% |
Final Project | May 28 | 40% |
There is no make-up for the homeworks. If you miss any of the homework deadline because of an emergency, do contact me to make an arrangement as soon as you can.
I will collect a written attendance in each lecture. I will use the attendance records for those students that are edge cases in their grades. (Push them up or down.)
For the homeworks, you are going to need to open a GitHub account and create private repository for this class. I am going to pull your howeworks and final project from your GitHub repositories at 11:59PM of each deadline date. You must open a private github repository and share it with my hotmail address: atabey_kaygun@hotmail.com. Then send my itu address (kaygun@itu.edu.tr) your name, student number and your private github repository link. Your deadline is February 20, 11:59PM.
I am going to post the homework assignments on the course github page, you'll need to fill in the answers and post it on your own github account by the deadline.
The final project is worth 40 points and will be evaluated on your final project notebook. You may work with a team, but no larger than 2 students. Open a separate directory for the project, and in that directory put a jupyter notebook with
- The title of the project,
- The list of team members (names and student numbers), and
- A detailed project proposal.
The proposal must be at least 1000 words and must contain
- A detailed description of the data set you are going to work with,
- The questions you would like to explore,
- The methods and algorithms you think you are going to need,
- A clear plan how you are going to get the answers you are looking for from the data.
I will grade your proposals (15 points) and might make adjustments on your data set, your hypothesis and your approach.
By regulations I must give a final exam. But in the exam I will only ask you explain your final project.
You may use large language models (ChatGPT, Claude, Code Pilot etc.) to assist you to code and write your HWs. However, you must include a log of your interaction with the LLM you are using. On the other hand, passing someone else's code or text as your own without proper attribution (including from LLMs) is cheating, or worse yet, theft. Copying code with variable names changed from a source without proper attribution is another form of cheating. Cheaters will receive 0 and be reported to the university. In short, don't do it.
The following is a tentative schedule of topics I am going to cover. I may go faster or slower depending on the week. I may even add new subjects, or even drop subjects depending on requests and participation.
- Data Science, Machine Learning, Statistics, Computer Science: Similarities and Differences.
- Crash Course in Python and its Library Ecosystem.
- Data types, data apis, popular data sources, and how to use them.
- Supervised and unsupervised learning. Cross-validation.
- Clustering vs classification. k-means clustering. k-nearest neighbor classification.
- Regression: OLS, regularization, lasso, elastic net.
- Logistic regression. Decision tree regression.
- Support Vector Machines.
- Hiearchical clustering.
- Density based clustering.
- Entropy and Gini. Decision trees. Random forests.
- Newton-Raphson. Gradient Descent. Perceptron. Neural Networks
- A taxonomy of neural networks. Applications.